Abstract
With the recent development of deep learning, AI models are widely used in various domains. AI models show good performance for definite tasks such as image classification and text generation. With the recent development of generative models (e.g., BigGAN, GPT-3), AI models also show impressive results for diverse generation tasks (e.g., photo-realistic image, paragraph generation). As the performance of each AI model improves, interest in comprehensive tasks, such as visual language navigation (VLN) which follows the language instruction with an egocentric view, is also growing. However, the model integration for VLN has a problem due to the model complexity, modal heterogeneity, and paired data shortage. This study provides a comprehensive survey on VLN with a systemic approach for reviewing recent trends. At first, we define a taxonomy for fundamental techniques which need to perform VLN. We analyze from four perspectives of VLN: representation learning, reinforcement learning, component, and evaluation. We investigate the pros and cons of each component and methodology that have been conducted recently. This survey categorizes major research institute's approaches with taxonomy defined in four perspectives, unlike other conventional surveys. Finally, we discuss current open challenges and conclude our study by giving possible future directions.
Similar content being viewed by others
References
Abbasnejad E, Teney D, Parvaneh A, Shi J, Hengel AVD (2020) Counterfactual vision and language learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10044–10054
Agarwal R, Schuurmans D, Norouzi M (2020) An optimistic perspective on offline reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 104–114
Alamri H, Hori C, Marks TK, Batra D, ParikhD (2018) Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In: DSTC7 at AAAI2019 Workshop 2
Alikhani M, Sharma P, Li S, Soricut R, Stone M (2020) Clue: Cross-modal coherence modeling for caption generation. In: Association for Computational Linguistics (ACL), 2020
Ammanabrolu P, Hausknecht M (2020) Graph constrained reinforcement learning for natural language action spaces. In: International Conference on Learning Representations (ICLR), 2020
Anand A, Belilovsky E, Kastner K, Larochelle H, Courville A (2018) Blindfold baselines for embodied QA. In: NIPS 2018 Visually-Grounded Interaction and Language (ViGilL) Workshop
Anderson P et al (2018) Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3674–3683
Arumugam D, Karamcheti S, Gopalan N, Williams EC, Rhee M, Wong LL, Tellex S (2019) Grounding natural language instructions to semantic goal representations for abstraction and generalization. Auton Robot 43(2):449–468
Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2019) Emergent tool use from multi-agent autocurricula. In: International Conference on Learning Representations, 2019
Banino A et al (2018) Vector-based navigation using grid-like representations in artificial agents. Nature 557(7705):429–433
Banino A et al (2020) Memo: a deep network for flexible combination of episodic memories. In: International Conference on Learning Representations (ICLR), 2020
Batra D et al (2020) Objectnav revisited: on evaluation of embodied agents navigating to objects. CoRR 2020
Bear DM et al (2020) Learning physical graph representations from visual scenes. In: 34th Conference on Neural Information Processing Systems (NeurIPS), 2020
Bertasius G, Torresani L (2020) COBE: contextualized object embeddings from narrated instructional video. In: Neurips 2020
Blukis V, Brukhim N, Bennett A, Knepper RA, Artzi Y (2018) Following high-level navigation instructions on a simulated quadcopter with imitation learning. Robot Sci Syst (RSS)
Blukis V, Terme Y, Niklasson E, Knepper RA, Artzi Y (2019) Learning to map natural language instructions to physical quadcopter control using simulated flight. In: Conference on Robot Learning (CoRL) 2019
Blukis V, Knepper RA, Artzi Y (2020) Few-shot object grounding and map** for natural language robot instruction following. In: 4th Conference on Robot Learning (CoRL 2020)
Brown TB et al (2020) Language models are few-shot learners. In: 34th Conference on Neural Information Processing Systems (NeurIPS), 2020
Bruce J, Sünderhauf N, Mirowski P, Hadsell R, Milford M (2018) Learning deployable navigation policies at kilometer scale from a single traversal. In: Proceedings of The 2nd Conference on Robot Learning. PMLR 87, pp 346–361
Cangea C, Belilovsky E, Liò P, Courville A (2019) VideoNavQA: bridging the gap between visual and embodied question answering. In: BMVC 2019
Cerda-Mardini, P., Araujo, V., & Soto, A. (2020) Translating natural language instructions for behavioral robot navigation with a multi-head attention mechanism. In: ACL 2020 WiNLP workshop
Chang M, Gupta A, Gupta S (2020) Semantic visual navigation by watching youtube videos. In: NeurIPS 2020
Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R (2020a) Learning to explore using active neural slam. In: International Conference on Learning Representations (ICLR), 2020a
Chaplot DS, Gandhi DP, Gupta A, Salakhutdinov RR (2020b) Object goal navigation using goal-oriented semantic exploration. Adv Neural Inf Process Syst 33:4247
Chaplot DS, Jiang H, Gupta S, Gupta A (2020c) Semantic curiosity for active visual learning. In: European Conference on Computer Vision. Springer, Cham, pp 309–326
Chaplot DS, Salakhutdinov R, Gupta A, Gupta S (2020d) Neural topological slam for visual navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12875–12884
Chen B, Song S, Lipson H, Vondrick C (2019a) Visual hide and seek. In: Artificial Life Conference Proceedings. One Rogers Street, MIT Press, Cambridge, MA
Chen H, Suhr A, Misra D, Snavely N, Artzi Y (2019b) Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12538–12547
Chen B et al (2020a) Robust policies via mid-level visual representations: an experimental study in manipulation and navigation. CoRL 2020
Chen C et al (2020b) Soundspaces: audio-visual navigation in 3d environments. In: Computer Vision–ECCV 2020a: 16th European Conference, vol 16. Springer, pp 17–36
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020c) Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12655–12663
Chen V, Gupta A, Marino K (2020d) Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning. In: ICLR 2021
Chen Y, Tian Y, He M (2020e) Monocular human pose estimation: a survey of deep learning-based methods. Comput Vision Image Underst 192:102897
Chen W, Gan Z, Li L, Cheng Y, Wang W, Liu J (2021) Meta module network for compositional visual reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 655–664
Chevalier-Boisvert M, Bahdanau D, Lahlou S, Willems L, Saharia C, Nguyen TH, Bengio Y (2019) BabyAI: first steps towards grounded language learning with a human in the loop. In: International Conference on Learning Representations, p 105
Chu, Y. W, Lin, K. Y, Hsu, C. C, Ku, L. W. (2020) Multi-step joint-modality attention network for scene-aware dialogue system. In: DSTC8 collocated with Association for the Advancement of Artificial Intelligence (AAAI) 2020
Colas C, Akakzia A, Oudeyer PY, Chetouani M, Sigaud O (2020a) Language-conditioned goal generation: a new approach to language grounding for RL. In: ICML 2020a Workshop
Colas C, Karch T, Lair N, Dussoux JM, Moulin-Frier C, Dominey PF, Oudeyer PY (2020b) Language as a cognitive tool to imagine goals in curiosity-driven exploration. In: NeurIPS 2020b
Co-Reyes JD et al (2019) Guiding policies with language via meta-learning. In: International Conference on Learning Representations (ICLR), 2019
Crook PA, Poddar S, De A, Shafi S, Whitney D, Geramifard A, Subba R (2019) SIMMC: situated Interactive Multi-Modal Conversational Data Collection And Evaluation Platform. In: ASRU 2019
Das A, Datta S, Gkioxari G, Lee S, Parikh D, Batra D (2018a) Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–10
Das A, Gkioxari G, Lee S, Parikh D, Batra D (2018b) Neural modular control for embodied question answering. In: Conference on Robot Learning. PMLR, pp 53–62
Das A et al. (2020) Probing emergent semantics in predictive agents via question answering. In: International Conference on Machine Learning (ICML), 2020
Datta S, Sikka K, Roy A, Ahuja K, Parikh D, Divakaran A (2019) Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2601–2610
Dean V, Tulsiani S, Gupta A (2020) See, hear, explore: curiosity via audio-visual association. In: NeurIPS 2020
Deitke M et al (2020) Robothor: an open simulation-to-real embodied ai platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3164–3174
Deng Z, Narasimhan K, Russakovsky O (2020) Evolving graphical planner: contextual global planning for vision-and-language navigation. In: Neurips2020
Do V, Camburu OM, Akata Z, Lukasiewicz T (2020) e-SNLI-VE-2.0: Corrected Visual-Textual Entailment with Natural Language Explanations. In: IEEE CVPR Workshop on Fair, Data Efficient and Trusted Computer Vision
Du H, Yu X, Zheng L (2020) Learning object relation graph and tentative policy for visual navigation. In: European Conference on Computer Vision. Springer, Cham, pp 19–34
Engelcke M, Kosiorek AR, Parker Jones O, Posner H (2020) GENESIS: generative scene inference and sampling of object-centric latent representations. In: Proceedings of the ICLR, 2020
Eysenbach B, Gupta A, Ibarz J, Levine S (2019) Diversity is all you need: learning skills without a reward function. In: ICLR 2019 Conference 752
Fan A, Jernite Y, Perez E, Grangier D, Weston J, Auli M. (2019) ELI5: long form question answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Fan A et al (2020) Generating interactive worlds with text. Proc AAAI Conf Artif Intell 34(02):1693–1700
Fang K, Toshev A, Fei-Fei L, Savarese S (2019) Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 538–547
Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y (2020) Video2commonsense: generating commonsense descriptions to enrich video captioning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Feng Q, Ablavsky V, Bai Q, Li G, Sclaroff S (2020) Real-time visual object tracking with natural language description. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 700–709
Fried D et al (2018) Speaker-follower models for vision-and-language navigation. In: Advances in Neural Information Processing Systems
Fu S, **ong K, Ge X, Tang S, Chen W, Wu Y (2020a) Quda: natural language queries for visual data analytics. CoRR 2020a
Fu TJ, Wang XE, Peterson MF, Grafton ST, Eckstein MP, Wang WY (2020b) Counterfactual vision-and-language navigation via adversarial path sampler. In: European Conference on Computer Vision. Springer, Cham, pp 71–86
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, vol 16. Springer, Berlin, pp 214–229
Gafni O, Wolf L, Taigman Y (2019) Vid2game: controllable characters extracted from real-world videos. In: ICLR 2020
Gan C, Zhang Y, Wu J, Gong B, Tenenbaum JB (2020) Look, listen, and act: towards audio-visual embodied navigation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 9701–9707
Gao R, Chen C, Al-Halah Z, Schissler C, Grauman K (2020) Visualechoes: spatial image representation learning through echolocation. In: European Conference on Computer Vision. Springer, Cham, pp 658–676
Garcia-Ceja E, Riegler M, Nordgreen T, Jakobsen P, Oedegaard KJ, Tørresen J (2018) Mental health monitoring with multimodal sensing and machine learning: a survey. Pervasive Mob Comput 51:1–26
Gidaris S, Bursuc A, Komodakis N, Pérez P, Cord M (2020) Learning representations by predicting bags of visual words. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6928–6938
Gordon D, Kembhavi A, Rastegari M, Redmon J, Fox D, Farhadi A (2018) Iqa: visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098
Gordon D, Kadian A, Parikh D, Hoffman J, Batra D (2019) Splitnet: Sim2sim and task2task transfer for embodied visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1022–1031
Goyal, P, Niekum, S, Mooney, RJ (2020) PixL2R: guiding reinforcement learning using natural language by map** pixels to rewards. In: Conference on Robot Learning (CoRL) 2020
Gruslys A et al (2020) The advantage regret-matching actor-critic. CoRR 2020
Guo Y, Cheng Z, Nie L, Liu Y, Wang Y, Kankanhalli M (2019) Quantifying and alleviating the language prior problem in visual question answering. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 75–84
Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13137–13146
Harish YVS, Pandya H, Gaud A, Terupally S, Shankar S, Krishna KM (2020) DFVS: deep flow guided scene agnostic image based visual servoing. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 9000–9006
He Z et al (2021) ActionBert: leveraging user actions for semantic understanding of user interfaces. In: AAAI Conference on Artificial Intelligence (AAAI-21) 2021
Heinrich S et al (2020) Crossmodal language grounding in an embodied neurocognitive model. Front Neurorobot. https://doi.org/10.3389/fnbot.2020.00052
Hermann KM, Malinowski M, Mirowski P, Banki-Horvath A, Anderson K, Hadsell R (2020) Learning to follow directions in street view. Proc AAAI Conf Artif Intell 34(07):11773–11781
Hill F, Tieleman O, von Glehn T, Wong N, Merzic H, Clark S (2020) Grounded language learning fast and slow. In: ICLR 2021
Hong R, Liu D, Mo X, He X, Zhang H (2019) Learning to compose and reason with language tree structures for visual grounding. IEEE Trans Pattern Anal Mach Intell 44:684
Hong Y, Rodriguez-Opazo C, Qi Y, Wu Q, Gould S (2020) Language and visual entity relationship graph for agent navigation. In: NeurIPS 2020
Hu H, Yarats D, Gong Q, Tian Y, Lewis M (2019) Hierarchical decision making by generating and following natural language instructions. In: Advances in neural information processing systems, 2019
Hu R, Singh A, Darrell T, Rohrbach M (2020) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9992–10002
Huang H, Jain V, Mehta H, Ku A, Magalhaes G, Baldridge J, Ie E (2019) Transferable representation learning in vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7404–7413
Hutsebaut-Buysse M, Mets K, Latré S (2020) Pre-trained word embeddings for goal-conditional transfer learning in reinforcement learning. In: International Conference on Machine Learning (ICML) 2020 Language in Reinforcement Learning (LaReL) Workshop
Ilinykh N, Zarrieß S, Schlangen D (2019) Meetup! a corpus of joint activity dialogues in a visual environment. In: Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue (semdial/LondonLogue)
Jaderberg M, Mnih V, Czarnecki WM, Schaul T, Leibo JZ, Silver D, Kavukcuoglu K (2017) Reinforcement learning with unsupervised auxiliary tasks. In: ICLR 2017
Jain U et al (2019) Two body problem: collaborative visual task completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6689–6699
Jaunet T, Vuillemot R, Wolf C (2020) DRLViz: understanding decisions and memory in deep reinforcement learning. Comput Gr Forum 39(3):49–61
Ji J, Krishna R, Fei-Fei L, Niebles JC(2020) Action genome: actions as compositions of spatio-temporal scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10236–10247
Jia B, Chen Y, Huang S, Zhu Y, Zhu SC (2020) Lemma: a multi-view dataset for learning multi-agent multi-task activities. In: European Conference on Computer Vision. Springer, Cham, pp 767–786
Jiang Y, Gu SS, Murphy KP, Finn C (2019) Language as an abstraction for hierarchical deep reinforcement learning. Adv Neural Inf Process Syst 32:9419–9431
Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020a) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10267–10276
Jiang M, Luketina J, Nardelli N, Minervini P, Torr PH, Whiteson S, Rocktäschel T (2020b) WordCraft: an environment for benchmarking commonsense agents. In: ICML, 2020b Workshop
Joze HRV, Shaban A, Iuzzolino ML, Koishida K (2020) MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13289–13299
Juliani A et al (2019) Obstacle tower: a generalization challenge in vision, control, and planning. In: International Joint Conferences on Artificial Intelligence (IJCAI), 2019
Kadian A et al (2019) Are we making real progress in simulated environments? Measuring the sim2real gap in embodied visual navigation. In: IEEE Robotics and Automation Letters (RA-L), 2019
Karch T, Lair N, Colas C, Dussoux JM, Moulin-Frier C, Dominey PF, Oudeyer PY (2020) Language-goal imagination to foster creative exploration in Deep RL. In: ICML 2020 Workshop
Khetarpal K, Ahmed Z, Comanici G, Abel D, Precup D (2020) What can I do here? A theory of affordances in reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 5243–5253
Kipf T et al (2019) Compile: Compositional imitation learning and execution. In: International Conference on Machine Learning. PMLR, pp 3418–3428
Koh JY, Baldridge J, Lee H, Yang Y (2021) Text-to-image generation grounded by fine-grained user attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 237–246
Krantz J, Wijmans E, Majumdar A, Batra D, Lee S (2020) Beyond the nav-graph: vision-and-language navigation in continuous environments. In: European Conference on Computer Vision. Springer, Cham, pp 104–120
Kreutzer J, Riezler S, Lawrence C (2020) Learning from human feedback: Challenges for real-world reinforcement learning in nlp. In: Real-World RL Workshop at NeurIPS, 2020
Ku A, Anderson P, Patel R, Ie E, Baldridge J (2020) Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Kulhánek J, Derner E, De Bruin T, Babuška R (2019) Vision-based navigation using deep reinforcement learning. In: 2019 European Conference on Mobile Robots (ECMR). IEEE, pp 1–8
Landi F, Baraldi L, Corsini M, Cucchiara R (2019) Embodied vision-and-language navigation with dynamic convolutional filters. In: The British Machine Vision Conference (BMVC), 2019
Landi F, Baraldi L, Cornia M, Corsini M, Cucchiara R (2021) Multimodal attention networks for low-level vision-and-language navigation. Comput Vision Image Underst 210:103255
Le H, Chen NF (2020) Multimodal transformer with pointer network for the dstc8 avsd challenge. In: DSTC Workshop at Association for the Advancement of Artificial Intelligence (AAAI), 2020
Le H, Hoi SC (2020) Video-grounded dialogues with pretrained generation language models. Assoc Comput Linguist (ACL). https://doi.org/10.48550/ar**v.2006.15319
Lewis M, Fan A (2018) Generative question answering: learning to answer the whole question. In: International Conference on Learning Representations
Li Y, Košecka J (2020) Learning view and target invariant visual servoing for navigation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 658–664
Li A, Hu H, Mirowski P, Farajtabar M (2019a) Cross-view policy learning for street navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8100–8109
Li J, Tang S, Wu F, Zhuang Y (2019b) Walking with mind: Mental imagery enhanced embodied qa. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 1211–1219
Li A, Bansal S, Giovanis G, Tolani V, Tomlin C, Chen M (2020a) Generating robust supervision for learning-based visual navigation using hamilton-jacobi reachability. In: Learning for Dynamics and Control. PMLR, pp 500–510
Li D, Yu X, Xu C, Petersson L, Li H (2020b) Transferring cross-domain knowledge for video sign language recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6205–6214
Li J, Wang X, Tang S, Shi H, Wu F, Zhuang Y, Wang WY (2020c) Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12123–12132
Li L, Chen YC, Cheng Y, Gan Z, Yu L, Liu J (2020d) Hero: hierarchical encoder for video+ language omni-representation pre-training. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020d
Li S, Chaplot DS, Tsai YHH, Wu Y, Morency LP, Salakhutdinov R (2020e) Unsupervised domain adaptation for visual navigation. In: Deep Reinforcement Learning Workshop at NeurIPS, 2020e
Li Z, Li Z, Zhang J, Feng Y, Zhou J (2021) Bridging text and video: a universal multimodal transformer for video-audio scene-aware dialog. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Liang M, Yang B, Zeng W, Chen Y, Hu R, Casas S, Urtasun R (2020) Pnpnet: end-to-end perception and prediction with tracking in the loop. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11553–11562
Lin AS, Wu L, Corona R, Tai K, Huang Q, Mooney RJ (2018) Generating animated videos of human activities from natural language descriptions. Learning 1
Liu A et al (2020a) Spatiotemporal attacks for embodied agents. In: European Conference on Computer Vision. Springer, Cham, pp 122–138
Liu J, Chen W, Cheng Y, Gan Z, Yu L, Yang Y, Liu J (2020b) Violin: a large-scale dataset for video-and-language inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10900–10910
Liu YT, Li YJ, Wang YCF (2020c) Transforming multi-concept attention into video summarization. In: Proceedings of the Asian Conference on Computer Vision
Loynd R, Fernandez R, Celikyilmaz A, Swaminathan A, Hausknecht M (2020) Working memory graphs. In: International Conference on Machine Learning. PMLR, pp 6404–6414
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems 2019
Lu J, Goswami V, Rohrbach M, Parikh D, Lee S (2020) 12-in-1: Multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10437–10446
Ma CY, Lu J, Wu Z, AlRegib G, Kira Z, Socher R, **ong C (2019a) Self-monitoring navigation agent via auxiliary progress estimation. In: International Conference on Learning Representations (ICLR), 2019a
Ma CY, Wu Z, AlRegib G, **ong C, Kira Z (2019b) The regretful agent: heuristic-aided navigation through progress estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6732–6740
Madureira B, Schlangen D (2020) An overview of natural language state representation for reinforcement learning. In: ICML 2020 Workshop on Language in Reinforcement Learning (LaReL), vol 4
Majumdar A, Shrivastava A, Lee S, Anderson P, Parikh D, Batra D (2020) Improving vision-and-language navigation with image-text pairs from the web. In: European Conference on Computer Vision. Springer, Cham, pp 259–274
Marasović A, Bhagavatula C, Park JS, Bras RL, Smith NA, Choi Y (2020) Natural language rationales with full-stack visual reasoning: from pixels to semantic frames to commonsense graphs. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings
Martins R, Bersan D, Campos MF, Nascimento ER (2020) Extending maps with semantic and contextual object information for robot navigation: a learning-based framework using visual and depth cues. J Intell Robot Syst 2020:1–15
Mei T, Zhang W, Yao T (2020) Vision and language: from visual perception to content creation. APSIPA Trans Signal Inf Process. https://doi.org/10.1017/ATSIP.2020.10
Miech A, Alayrac JB, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9879–9889
Mirowski P et al (2017) Learning to navigate in complex environments. In: International Conference on Learning Representations (ICLR), 2017
Mirowski P et al (2018) Learning to navigate in cities without a map. Adv Neural Inf Process Syst 31:2419–2430
Mirowski P et al (2019) The streetlearn environment and dataset. CoRR2019
Mogadala A, Kalimuthu M, Klakow D (2021) Trends in integration of vision and language research: A survey of tasks, datasets, and methods. J Artif Intell Res 71:1183
Moghaddam MK, Wu Q, Abbasnejad E, Shi J (2021) Optimistic agent: accurate graph-based value estimation for more successful visual navigation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3733–3742
Moon S et al (2020) Situated and interactive multimodal conversations. In: Proceedings of the 28th International Conference on Computational Linguistics, pp 1103–1121
Morad SD, Mecca R, Poudel RP, Liwicki S, Cipolla R (2021) Embodied visual navigation with automatic curriculum learning in real environments. IEEE Robot Autom Lett 6(2):683–690
Mou X, Sigouin B, Steenstra I, Su H (2020) Multimodal dialogue state tracking by qa approach with data augmentation. In: Association for the Advancement of Artificial Intelligence (AAAI) DSTC8 Workshop
Mshali H, Lemlouma T, Moloney M, Magoni D (2018) A survey on health monitoring systems for health smart homes. Int J Ind Ergon 66:26–56
Nagarajan T, Grauman K (2020) Learning affordance landscapes for interaction exploration in 3D environments. In: NeurIPS 2020
Nagarajan T, Li Y, Feichtenhofer C, Grauman K (2020) Ego-topo: environment affordances from egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 163–172
Narasimhan M, Wijmans E, Chen X, Darrell T, Batra D, Parikh D, Singh A (2020) Seeing the un-scene: Learning amodal semantic maps for room navigation. In: European Conference on Computer Vision. Springer, Cham, pp 513–529
Nguyen K, Daumé III H (2019a) Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Nguyen K, Dey D, Brockett C, Dolan B (2019b) Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12527–12537
Pan X, Zhang T, Ichter B, Faust A, Tan J, Ha S (2020) Zero-shot imitation learning from demonstrations for legged robot visual navigation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 679–685
Pan J, Chen S, Shou MZ, Liu Y, Shao J, Li H (2021) Actor-context-actor relation network for spatio-temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 464–474
Park SM, Kim YG (2021) Survey and challenges of story generation models-A multimodal perspective with five steps: data embedding, topic modeling, storyline generation, draft story generation, and story evaluation. Inf Fusion 67:41–63
Patel R, Rodriguez-Sanchez R, Konidaris G (2020) On the relationship between structure in natural language and models of sequential decision processes. In: The 1st Workshop on Language in Reinforcement Learning, International Conference on Machine Learning (ICML), 2020
Patro B, Namboodiri VP (2018) Differential attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7680–7688
Perez E, Lewis P, Yih WT, Cho K, Kiela D (2020) Unsupervised question decomposition for question answering. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Prabhudesai M, Tung HYF, Javed SA, Sieb M, Harley AW, Fragkiadaki K (2020) Embodied language grounding with 3d visual feature representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2220–2229
Puig X et al (2020) Watch-and-help: a challenge for social perception and human-AI collaboration. In: ICLR2021
Qi W, Mullapudi RT, Gupta S, Ramanan D (2020a) Learning to move with affordance maps. In: International Conference on Learning Representations (ICLR), 2020a
Qi Y, Pan Z, Zhang S, van den Hengel A, Wu Q (2020b) Object-and-action aware model for visual language navigation. In: Computer Vision–ECCV 2020b: 16th European Conference, vol 16. Springer, pp 303–317
Qi Y, Wu Q, Anderson P, Wang X, Wang WY, Shen C, Hengel AVD (2020c) Reverie: remote embodied visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9982–9991
Qiu Y, Pal A, Christensen HI (2020) Target driven visual navigation exploiting object relationships. In: CoRL 2020
Raffel C et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:1–67
Ramakrishnan SK, Al-Halah Z, GraumanK (2020) Occupancy anticipation for efficient exploration and navigation. In: European Conference on Computer Vision. Springer, Cham, pp 400–418
Rao M, Raju A, Dheram P, Bui B, Rastrow A (2020) Speech to semantics: improve asr and nlu jointly via all-neural interfaces. In: Proceedings of INTERSPEECH, 2020
Ren M, Iuzzolino ML, Mozer MC, Zemel RS (2020) Wandering within a world: online contextualized few-shot learning. In: ICML 2020 Workshop LifelongML
Ritter S, Faulkner R, Sartran L, Santoro A, Botvinick M, Raposo D (2020) Rapid task-solving in novel environments. In: ICLR 2021
Rosano M, Furnari A, Gulino L, Farinella GM (2020) A comparison of visual navigation approaches based on localization and reinforcement learning in virtual and real environments. In: VISIGRAPP, pp 628–635
Rosenberger P, Cosgun A, Newbury R, Kwan J, Ortenzi V, Corke P, Grafinger M (2020) Object-independent human-to-robot handovers using real time robotic vision. IEEE Robot Autom Lett 6(1):17–23
Sadhu A, Chen K, Nevatia R (2020) Video object grounding using semantic roles in language description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10417–10427
Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4808–4816
Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing—NeurIPS, 2019
Savva M et al (2019) Habitat: a platform for embodied ai research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9339–9347
Sax A, Zhang JO, Emi B, Zamir A, Savarese S, Guibas L, Malik J (2019) Learning to navigate using mid-level visual priors. In: Conference on Robot Learning, 2019
Shah P, Fiser M, Faust A, Kew JC, Hakkani-Tur D (2018) Follownet: robot navigation by following natural language directions with deep reinforcement learning. In: Third Machine Learning in Planning and Control of Robot Motion Workshop at ICRA, 2018
Shah R, Krasheninnikov D, Alexander J, Abbeel P, Dragan A (2019) The implicit preference information in an initial state. In: International Conference on Learning Representations
Shamsian A, Kleinfeld O, Globerson A, Chechik G (2020) Learning object permanence from video. In: European Conference on Computer Vision. Springer, Cham, pp 35–50
Shen WB, Xu D, Zhu Y, Guibas LJ, Fei-Fei L, Savarese S (2019) Situational fusion of visual representation for visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2881–2890
Shridhar M et al (2020) Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749
Shridhar M, Yuan X, Côté MA, Bisk Y, Trischler A, Hausknecht M (2021) ALFWorld: aligning text and embodied environments for interactive learning. In: ICLR2021
Shuster K, Humeau S, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12516–12526
Shuster K, Urbanek J, Dinan E, Szlam A, Weston J (2020) Deploying lifelong open-domain dialogue learning. CoRR 2020
Sigurdsson G et al (2020) Visual grounding in video for unsupervised word translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10850–10859
Silva R, Vasco M, Melo FS, Paiva A, Veloso M (2020) Playing games in the Dark: an approach for cross-modality transfer in reinforcement learning. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, 2020
Singh A et al (2019) Towards vqa models that can read. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8317–8326
Siriwardhana S, Weerasekera R, Nanayakkara S (2018) Target driven visual navigation with hybrid asynchronous universal successor representations. In: Deep Reinforcement Learning Workshop, NeurIPS, 2018
Srinivas A, Laskin M, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement learning. In: International Conference on Machine Learning (ICML), 2020
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) Vl-bert: Pre-training of generic visual-linguistic representations. In: International Conference on Learning Representations (ICLR), 2020
Suhr A et al (2019) Executing instructions in situated collaborative interactions. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D (2020) Mobilebert: a compact task-agnostic bert for resource-limited devices. In: ACL, 2020
Szlam A et al (2019) Why build an assistant in minecraft? CoRR 2019
Tamari R, Shani C, Hope T, Petruck MR, Abend O, Shahaf D (2020) Language (re) modelling: towards embodied language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Tan H, Bansal M (2020) Vokenization: improving language understanding with contextualized, visual-grounded supervision. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Tan S, Liu H, Guo D, Zhang X, Sun F (2020) Towards embodied scene description. Robot Sci Syst
Thomason J, Gordon D, Bisk Y (2019) Shifting the baseline: single modality performance on visual navigation qa. NAACL 2019:1977–1983
Thomason J, Murray M, Cakmak M, Zettlemoyer L (2020) Vision-and-dialog navigation. In: Conference on Robot Learning. PMLR, pp 394–406
Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2019, p 6558
Wang X, **ong W, Wang H, Wang WY (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 37–53
Wang X et al (2019a) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6629–6638
Wang X, Jain V, Ie E, Wang WY, Kozareva Z, Ravi S (2019b) Natural language grounded multitask navigation. In: ViGIL@ NeurIPS
Wang J, Zhang Y, Kim TK, Gu Y (2020a) Modelling hierarchical structure between dialogue policy and natural language generator with option framework for task-oriented dialogue system. In: ICLR 2021
Wang XE, Jain V, Ie E, Wang WY, Kozareva Z, Ravi S (2020b) Environment-agnostic multitask learning for natural language grounded navigation. In: Computer Vision–ECCV 2020b: 16th European Conference, vol. 16. Springer, pp 413–430
Wang Y (2021) Survey on deep multi-modal data analytics: collaboration, rivalry, and fusion. ACM Trans Multimed Comput Commun Appl TOMM 17(1s):1–25
Waytowich N, Barton SL, Lawhern V, Warnell G (2019) A narration-based reward sha** approach using grounded natural language commands. In: The Imitation, Intent and Interaction (I3) workshop, ICML 2019
Wijmans E et al (2019) Embodied question answering in photorealistic environments with point cloud perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6659–6668
Wijmans E et al (2020) Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In: ICML 2020 Workshop
Wiles O, Gkioxari G, Szeliski R, Johnson J (2020) Synsin: end-to-end view synthesis from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7467–7477
Wortsman M, Ehsani K, Rastegari M, Farhadi A, Mottaghi R (2019) Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6750–6759
Wu Y, Wu Y, Gkioxari G, Tian Y (2018a) Building generalizable agents with a realistic and rich 3d environment. In: ICLR, 2018
Wu Y, Wu Y, Gkioxari G, Tian Y, Tamar A, Russell S (2018b) Learning a Semantic Prior for Guided Navigation. In: European Conference on Computer Vision (ECCV), 2018
Wu Y, Wu Y, Tamar A, Russell S, Gkioxari G, Tian Y (2019) Bayesian relational memory for semantic visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2769–2779
Wu J, Li G, Han X, Lin L (2020a) Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1283–1291
Wu Q, Manocha D, Wang J, Xu K (2020b) Neonav: improving the generalization of visual navigation via generating next expected observations. Proc AAAI Conf Artif Intell 34(06):10001–10008
Wu SA, Wang RE, Evans JA, Tenenbaum J, Parkes DC, Kleiman-Weiner M (2020c) Too many cooks: coordinating multi-agent collaboration through inverse planning. In: CogSci
**a F et al (2020) Interactive gibson benchmark: a benchmark for interactive navigation in cluttered environments. IEEE Robot Autom Soc 5(2):713
**ang F et al (2020a) Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition:11097–11107
**ang J, Wang XE, Wang WY (2020b) Learning to stop: a simple yet effective approach to urban vision-language navigation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
**e L, Markham A, Trigoni N (2020) SnapNav: learning mapless visual navigation with sparse directional guidance and visual reference. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 1682–1688
Ye J, Batra D, Wijmans E, Das A (2020) Auxiliary tasks speed up learning pointgoal navigation. CoRL 2020
Yu H, Lian X, Zhang H, Xu W (2018) Guided feature transformation (gft): a neural language grounding module for embodied agents. In: Conference on Robot Learning. PMLR, pp 81–98
Yu D et al (2019a) Commonsense and semantic-guided navigation through language in embodied environment. In: ViGIL@ NeurIPS
Yu L, Chen X, Gkioxari G, Bansal M, Berg TL, Batra D (2019b) Multi-target embodied question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6309–6318
Zaheer M et al (2020) Big Bird: transformers for longer sequences. In: NeurIPS
Zeng F, Wang C, Ge SS (2020) A survey on visual navigation for artificial agents with deep reinforcement learning. IEEE Access 8:135426–135442
Zhan X, Pan X, Dai B, Liu Z, Lin D, Loy CC (2020) Self-supervised scene de-occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3784–3792
Zhang Y, Hassan M, Neumann H, Black MJ, Tang S (2020) Generating 3d people in scenes without people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6194–6204
Zheng L, Zhu C, Zhang J, Zhao H, Huang H, Niessner M, Xu K (2019) Active scene understanding via online semantic reconstruction. Comput Gr Forum 38(7):103–114
Zhong V, Rocktäschel T, Grefenstette E (2019) RTFM: generalising to novel environment dynamics via reading. In: International Conference on Learning Representations (ICLR), 2020
Zhou L, Small K (2020) Inverse reinforcement learning with natural language goals. CoRR 2020
Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L, Farhadi A (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA), pp 3357–3364
Zhu F, Zhu, Y Chang X, Liang X (2020a) Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10012–10022
Zhu Y et al (2020b) Dark, beyond deep: a paradigm shift to cognitive ai with humanlike common sense. Engineering 6(3):310–345
Zhu Y, Zhu F, Zhan Z, Lin B, Jiao J, Chang X, Liang X (2020c) Vision-dialog navigation by exploring cross-modal memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10730–10739
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C2012635)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
See Table 8.
Rights and permissions
About this article
Cite this article
Park, SM., Kim, YG. Visual language navigation: a survey and open challenges. Artif Intell Rev 56, 365–427 (2023). https://doi.org/10.1007/s10462-022-10174-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-022-10174-9