Graph neural networks in vision-language image understanding: a survey

Senior, Henry; Slabaugh, Gregory; Yuan, Shanxin; Rossi, Luca

doi:10.1007/s00371-024-03343-0

Graph neural networks in vision-language image understanding: a survey

Review
Open access
Published: 29 March 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Graph neural networks in vision-language image understanding: a survey

Download PDF

Henry Senior¹,
Gregory Slabaugh¹,
Shanxin Yuan¹ &
…
Luca Rossi²

1641 Accesses
3 Altmetric
Explore all metrics

Abstract

2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying the objects in an image, and instead, it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, visual question answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus, in recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component, especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture.

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

CBAM: Convolutional Block Attention Module

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recent years have seen an explosion of research into graph neural networks (GNNs), with a flurry of new architectures being presented in top-tier machine learning conferences and journals every year [28] and yielded good results without the need for deep learning. However, not all images may have appropriate captions. If the captions are generic, they will only be able to describe aspects of an image and may omit its most important feature. In contrast, template-based captioning [56] uses a predefined caption format and uses object detection to fill in the blanks. This approach is good for generating consistent captions, but can result in captions that are unnatural and clearly generated by a machine. Contemporary approaches to the task of image captioning are based on deep learning models. Early work focused on a CNN encoder feeding an RNN-based decoder [57]; however, more recent deep learning approaches have developed to incorporate a wide variety of techniques including GNNs [39, 58] and Transformers [59, 60]. In this survey, we focus specifically on deep learning approaches to image captioning and focus on graph-based approaches. Deep learning approaches are typically trained on the COCO [47] or Flickr30k [48] which contain a set of images accompanied by five human generated captions. Closely related to contemporary deep learning-based captioning are the tasks of paragraph captioning and video captioning. Paragraph image captioning is the challenge of generating a multi-sentence description of an image [61, 62], whilst video captioning focuses on describing videos. Readers interested in video captioning are directed to the recent survey [63].

Taxonomies of VQA are usually defined through the lens of the datasets used by the various tasks [70], IMBd, and [69], respectively. Adjacent to the task of VQA and its sub-tasks is the field of Visual Grounding (sometimes known as Referring Expression). This is the task of identifying the salient regions of an image based on a natural language query. Although the task is closely aligned with VQA, it falls outside the scope of this paper. We direct the readers to [15].

Image retrieval spans multiple tasks, all of which make use of deep learning in contemporary approaches. We follow the taxonomy of Alexander et al. [45] and address the following sub-tasks: 1) text-based image retrieval where images are returned based on a text query; 2) content-based image retrieval where images are retrieved based on their similarity to an input image; 3) sketch-based retrieval, where images are retrieved based on their similarity to a sketch; 4) semantic-based retrieval which returns images based on their perceptual content; and 5) annotation-based retrieval, where images are returned using meta-data annotations. The number of datasets used for image retrieval are vast, and the community has not solidified around a single dataset in the way image captioning has around COCO [47]. This presents a challenge when making accurate comparisons between systems as the challenge presented by different datasets varies complicating direct comparisons across datasets. Whilst image retrieval specific datasets exist [71], there are papers [72,73,74] that make use of image captioning datasets [47, 48], showing the wide range of varied datasets that exist for image retrieval.

Understanding the inherent biases in datasets is incredibly important for deep learning researchers and practitioners. As models move beyond research benchmarks and into mainstream use, models that are trained on biased data will produce biased outputs and may contribute to the proliferation of harmful stereotypes. Within the scope of Vision-Language, work has been done to discover negative biases in core datasets such as COCO, enabling researchers to mitigate these risks [75]. Comprehensively demonstrated the gender, racial, and Western biases that exist in the COCO [47] dataset. The research finds lighter-skinned individuals are $7.5\times $ more common than darker-skinned individuals, and males are $2\times $ more common than females. This leads to the concern that image captioning models may come to see the world as being predominantly occupied by light-skinned males. Worryingly, [75] also find the existence of racial slurs in the ground truth captions which leads to concerns about captioning models producing captions containing this derogatory language.

Hirota et al. [76] continue the work of [75] but focus on VQA datasets. They find evidence of bias in Visual Genome [46] and OK-VQA [51], two widely used VQA datasets. The biases include a reflection of traditional gender stereotypes and a US-centric viewpoint on race and nationality.

In addition to the gender and racial biases in existing Vision-Language datasets, a lot of the datasets are limited by style. The vast majority of datasets contain real world photographs (typically mined from Flickr), which limits models to only understanding photographs. This limitation is most significant in image captioning and image retrieval as the style is a significant component of either the caption or retrieval query. Unless the question asked is specifically about the style of the image, then the impact is somewhat limited for VQA.

2.2 Fundamental graph theoretical concepts

Undirected graph. We define an undirected graph G to be a tuple of sets V and E, i.e. $G = (V,E)$. The set V contains n vertices (sometimes referred to as nodes) that are connected by the edges in the set E, i.e. if $v \in V$ and $u \in V$ are connected by an edge then $e_{v,u} \in E$. For an undirected graph, we have that $e_{v,u} = e_{u,v}$.

Directed graph A directed graph is a graph where the existence of $e_{v,u}$ does not imply the existence of $e_{u,v}$ as well. Let A be the $n \times n$ binary adjacency matrix such that $\textbf{A}_{v,u} = 1$ if $e_{v,u} \in E$. Then it follows that $\textbf{A}$ is asymmetric (symmetric) for directed (undirected) graphs. More in general, $\textbf{A}$ can be a real-valued matrix, where the value of $\textbf{A}_{v,u}$ can be interpreted as the strength of the connection between v and u.

Neighbourhood. The neighbourhood $\mathcal {N}(v)$ of a vertex $v \in V$ is the subset of nodes in V that are connected to v. The neighbour u can be either directly connected to v, i.e. $(v,u) \in E$, or indirectly connected by traversing r edges from v to u. Note that some definitions include v itself as part of the neighbourhood.

Complete graph. A complete graph is one (directed or undirected) where for each vertex, there is an edge connecting it to every other vertex in the set V. A complete graph is therefore a graph with the maximum number of edges for a given number of nodes.

Multi-partite graph. A multi-partite graph (also known as K-partite graph) is a graph where the nodes can be separated into K different sets. For scene understanding tasks, this allows for a graph representation where one set of nodes represent objects and another represents relationship between objects.

Multi-modal graph. A multi-modal graph is one with nodes that have features from different modalities. This approach is commonly used in VQA where the image and text modalities are mixed. Multi-modal graphs enable visual features to coexist in a graph with word embeddings.

2.3 Common graph types in 2D vision-language tasks

This section organises the various graph types used across all three tasks discussed in the survey. Some graphs, such as the semantic and spatial graphs, are used across all tasks [39, 55, 73], whilst others are more domain specific, like the knowledge graph [53, 77]. Figure 2 shows a sample image from the COCO dataset [47] together with various types of graphs that can be used to describe it. This section, alongside the figure, is organised so that graph that represent a single image and graphs that represent portions the dataset are grouped together.

Semantic graph, multi-partite semantic graph, and textual semantic graph. Sometimes referred to as a scene graph, a semantic graph (shown in Fig. 2b) is a one that encapsulates the semantic relationships between visual objects within a scene. Across the literature, the terms ‘semantic graph’ and ‘scene graph’ are used somewhat interchangeably, depending on the paper. However, in this survey we use the term ‘semantic graph’ because there are many ways to describe a visual scene as a graph, whereas the ‘semantic graph’ label is more precise about what the graph represents. Semantic graphs come in different flavours. One approach is to define a directed graph with nodes representing visual objects extracted by an object detector such as Faster-RCNN [78] and edges representing semantic relationships between them. This is the approach of Yao et al. [39], where, using a dataset such as Visual Genome [46], a model predicts the semantic relationships to form edges in the graph. Alternatively, the semantic graph can be seen as a multi-partite graph [58, 79,80,81] (shown in Fig. 2c), where attribute nodes describe the object nodes they are linked to. They also change the way relationships are represented by using nodes rather than edge features. This yields a semantic graph with three node types: visual object, object attribute, and inter-object relationship. This definition follows that of the ‘scene graph’ defined by Johnson et al. [38]. Finally, another form of semantic graph exists, the textual semantic graph[58, 82] (shown in Fig. 2d). Unlike visual semantic graphs, textual ones are not generated from the image itself but rather its caption. Specifically, the caption is parsed through the Stanford Dependency Parser [83], a widely used [84, 85] probabilistic sentence parser. Given a caption, the parser will return its grammatical structure, identifying components such nouns, verbs, and adjectives and marking the relationship between them. This is then modified from a tree into a graph, following the techniques outlined in [86].

Spatial graph. Yao et al. [39] define a spatial graph (Fig. 2e) as one representing the spatial relationship between objects. Visual objects detected by an object detector form nodes, and the edges between the nodes represent one of 11 predefined spatial relationships that may occur between the two objects. These include inside (labelled ‘1’), cover (labelled ‘2’), overlap (labelled ‘3’), and eight positional relationships (labelled ‘4’–‘11’) based on the angle between the centroid of the two objects. These graphs are directional but will not always be complete as there are cases where two objects have a weak spatial relationship and are therefore not connected by an edge in the spatial graph. Guo et al. [80] define a graph of a similar nature known as a geometry graph. It is defined as an undirected graph that encodes relative spatial positions between objects with an overlap and relative distance that meet certain thresholds.

Hierarchical spatial (Tree). These graphs build on from the spatial graph but the relationships between nodes focus on the hierarchical nature of the spatial relationship between the detected objects within an image. Yao et al. [87] propose to use a tree (i.e. a graph where each pair of nodes is connected by a single path) to define a hierarchical image representation. An image ($\mathcal {I}$) is first divided into regions using Faster-RCNN [78] ($\mathcal {R} = \{r_i\}^K_{i=1}$) with each region being further divided into instance segmentations ($\mathcal {M} = \{m_i\}^K_{i=1}$). This gives a three-layer tree structure ($\mathcal {T} = (\mathcal {I}, \mathcal {R}, \mathcal {M}, \mathcal {E}_{tree})$, where $\mathcal {E}_{tree}$ is the set of connecting edges) to represent the image, as shown in Fig. 2f. He et al. [60] use a hierarchical spatial graph, with relationships representing ‘parent’, ‘child’, and ‘neighbour’ relationships depending on the intersection over union of the bounding boxes.

Similarity graph. The similarity graph (Fig. 2) proposed by Kan et al.[88] (referred to as a semantic graph by the authors) is generated by computing the dot product between two visual features extracted by Faster-RCNN [78]. The dot products are then used to form the values of an adjacency matrix A as the operation captures the similarity between two vectors, the higher the dot product, the closer the two vectors are. Faster-RCNN extracts a set of n visual features, where each feature x(v) is associated to a node v and the value of the edge between two nodes v and u is given by $\textbf{A}_{u,v} = \sigma \left( x(v)^T \textbf{M} x(u)\right) $, where $\sigma (\cdot )$ is a nonlinear function and $\textbf{M}$ is a learnt weight matrix. The authors of [88] suggest that generating the graph this way allows for relationships between objects to be discovered in a data-driven manner, rather than relying on a model trained on a dataset such as the Visual Genome [46].

Image graphs/K-nearest neighbour graph. In their 2021 image captioning work, Dong et al. [89] construct an image graph by converting images into a latent feature space by averaging the object vectors output by feeding the image into Faster-RCNN [78]. The K closest images from the training data or search space in terms of $l_2$ distance are then turned into an undirected complete graph, shown in Fig. 2h. This is a similar approach used by Liu et al. [90] with their K-nearest neighbour graph.

Topic graph. Proposed by Kan et al. [88], the topic graph is an undirected graph of nodes representing topics extracted by GPU-DMM [91]. Topics are latent features representing shared knowledge across the entire caption set. Modelling them as a graph, as shown in Fig. 2i, with edges computed by taking the dot product of the two nodes, allows the modelling of knowledge represented in the captions.

Region adjacency graph. A region adjacency graph (RAG) is a graph made up of nodes representing homogeneous segments of the image, with edges representing the connection of adjacent regions. There are different approaches to defining the regions, but patches or superpixels are commonly used. Patches, small equal divisions of the image, are used by Sui et al. [92] whilst superpixels, an unsupervised segmentation approach for clustering nearby pixels, are used by [93].

Knowledge graph. A knowledge graph, or fact graph, is a graph-based representation of information. Whilst there is no agreed structure of these graphs [94], they typically take the form of triplets. They are used in a wide variety of tasks to provide the information needed to ‘reason’. Hence, knowledge graphs enable the FVQA task.

3 An overview of graph neural networks

Over the past years, a large number of GNN architectures have been introduced in the literature. Wu et al. [95] proposed a taxonomy containing four distinct groups: recurrent GNNs, convolutional GNNs, autoencoder GNNs, and spatial–temporal GNNs. The applications discussed in this paper mostly utilise convolutional GNNs, for a comprehensive overview of other architectures readers are directed to [95]. GNNs, especially traditional architectures such as graph convolutional network, have a deep grounding in relational inductive biases [41]. They are built on the assumption of homophily, i.e. that connected nodes are similar. There is an increasing body of work looking into addressing some of the bottlenecks that GNNs may suffer from. Novel training strategies such as [96] have been shown to reduce GPU memory whilst approaches such as [97] reduce the difference in performance when dealing with homophilic or heterophilic graphs.

3.1 Graph convolutional networks

One common convolutional GNN architecture is the message passing neural networks (MPNNs) proposed by Gilmer et al. Although this architecture has been shown to be limited [98], it forms a good abstraction of GNNs.

Gilmer et al. describe MPNNs as being comprised of a message function, update function, and readout function. These functions will vary depending on the application of the network, but are learnable, differentiable, and permutation invariant. The message and update functions will run for a number of time steps T, passing messages between connected nodes of the graph. These are used to update the hidden feature vectors of the nodes, which are then used to update the node feature vector, which in turn is used in the readout function.

The messages are defined as

$$\begin{aligned} \textbf{m}^{(t + 1)}_v = \sum _{u \in \mathcal {N}(v)} M_t(\textbf{h}^{(t)}_v, \textbf{h}^{(t)}_u, \textbf{e}_{v,u}) \,, \end{aligned}$$

(1)

where a message for a node at the next time step $\textbf{m}^{(t + 1)}_v$ is given by combining its current hidden state $\textbf{h}^{(t)}_v$ with that of its neighbour $\textbf{h}^{(t)}_u$ and any edge feature $\textbf{e}_{v,u}$ in a multi-layer perceptron (MLP) $M_t(\cdot )$. Given that a message is an aggregation of all the connected nodes, the summation acts over the nodes connected to the node $u \in \mathcal {N}(v)$, i.e. the neighbourhood of v.

These messages are then used to update the hidden vectors by combining the node current state with the message in an MLP $U_t$.

$$\begin{aligned} \textbf{h}^{(t+1)}_v = U_t(\textbf{h}^t_v, \textbf{m}^{(t+1)}_v)\end{aligned}$$

(2)

Once the message passing phase has run for T time steps, a readout phase is then conducted using a readout function, $R(\cdot )$. This is defined as an MLP that considers the updated feature vectors of nodes ($\textbf{h}^T_v$) on the whole graph ($v \in G$) to produce a prediction and is defined as:

$$\begin{aligned} \hat{y} = R(\{\textbf{h}^T_v | v \in G\})\end{aligned}$$

(3)

In order to make the GCN architecture scale to large graphs, the GraphSAGE [99] architecture changes the message function. Rather than taking messages from the entire neighbourhood of a node, a random sample is used. This reduces the number of messages that require processing, resulting in an architecture that works well on large graphs.

3.2 Gated graph neural networks

The core idea behind the gated graph neural network (GGNN) [100] is to replace the update function from the message passing architecture (Eq. 2) with a gated recurrent unit (GRU) [101]. The GRU is a recurrent neural network with a update and reset gates that controls which data can flow through the network (and be retained) and which data cannot (and therefore be forgotten).

$$\begin{aligned} \textbf{h}^{(t+1)}_v = GRU\left( \textbf{h}^{(t)}_v, \sum _{u \in \mathcal {N}(v)}\textbf{W}\textbf{h}^{(t)}_u\right) \,. \end{aligned}$$

(4)

where $\textbf{h}^t_v$ is the hidden feature of node v at time t, $\textbf{W}$ is a learnt weight matrix, and $u \in \mathcal {N}(v)$ is the subset of nodes in the graph connected to node v.

The GGNN also replaces the message function from Eq. 1 with a learnable weight matrix. Using the GRU alongside back-propagation through time enables the GGNN to operate on series data. However, due to the recurrent nature of the architecture, it can become unfeasible in terms of memory to run the GGNN on large graphs.

3.3 Graph attention networks

Following on from the multi-head attention mechanism of the popular Transformer architecture [40], graph attention networks (GATs) [102] extend the common GCN to include this attention attribute. Using an attention function, typically modelled by an MLP, the architecture calculates an attention weighting between two nodes. This process is repeated K times using K attention heads in parallel. The attention scores are then averaged to give the final weights.

The self-attention is computed by a function $a(h^t_v, h^t_u)$ (typically an MLP) that attends to the hidden representation of a node ($h^t_v$) and one of its neighbours ($h^t_u$). Once every node pairing in the graph has their attention computed, the scores are passed through a softmax function to give a normalised attention coefficient ($\alpha _{v,u}$). This process is then extended to multi-head attention by repeating the process across K different attention heads, each with different initialisation weights. The final node representation is achieved by concatenating or averaging (represented as $\Vert $) the K attention heads together.

$$\begin{aligned} \textbf{h}_v^{(t+1)} = \Bigg \Vert ^K_{k=1} \sigma \left( \sum _{u \in \mathcal {N}(v)} \alpha _{v,u}^{(k)} \textbf{W}^{(k)} \textbf{h}_u\right) \end{aligned}$$

(5)

where $\sigma $ is a nonlinear activation function such as ReLU and $\textbf{W}$ is a learnable weight matrix.

3.4 Graph memory networks

Recent years have seen the development of graph memory networks, which can conceptually be thought of as models with an internal and external memory. When there are multiple graphs overlap** the same spatial information, as in [103], the use of some form of external memory can allow for an aggregation of node updates and the graph undergoes message passing. This essentially allows for features from multiple graphs to be combined in some way that goes beyond a more simplistic pooling operation. In the case of Khademi [103], two graphs are constructed across the same image but may have different nodes. These graphs are updated using a GGNN. An external spatial memory is constructed to aggregate information from across the graphs as they are updated, using a neural network with an attention mechanism. The final state of the spatial memory is used to perform the final task.

3.5 Modern graph neural network architectures

In recent years, the limits of message passing GNNs have become increasingly evident, from their tendency to oversmooth the input features as the depth of the network increases [104] to their unsatisfactory performance in heterophilic settings [105], i.e. when neighbouring nodes in the input graphs are dissimilar. Furthermore, the expressive power of GNNs based on the message passing mechanism has been shown to be bounded by that of the well-known Weisfeiler–Lehman isomorphism test [98], meaning that there are inherent limits to their ability to generate different representations for structurally different input graphs.

Motivated by the desire to overcome these issues, researchers have now started looking at alternative models that move away from standard message passing architectures. Efforts in this direction include, among many others, higher-order message passing architectures [106], cell complexes networks [107], networks based on diffusion processes [2, 105, 108]. To the best of our knowledge, the application of these architectures to the 2D image understanding tasks discussed in this paper has not been explored yet. As such, we refer the readers to the referenced papers for detailed information on the respective architectures.

4 Image captioning

Image captioning is the challenging task of producing a natural language description of an image. Outside of being an interesting technical challenge, it presents an opportunity to develop accessibility technologies for severely sight impaired (formally ‘blind’) and sight impaired users (formally ‘visually impaired’^{Footnote 2}). Additionally, it has applications in problems ranging from image indexing [109] to surveillance [88]. There are three forms of image captioning techniques: 1) retrieval-based captioning, where a caption is retrieved from a set of existing captions, 2) template-based captioning, where a pre-existing template is filled in using information extracted from the image, and 3) deep learning-based image captioning, where a neural network is tasked with generating a caption from an input image. We propose to refine this taxonomy to differentiate between GNN-based approaches and more traditional deep learning powered image captioning. The following section details the GNN-based approaches to image captioning, of which there have been a number of in recent years. Figure 3 illustrates the structure of a generic GNN-based image captioning architecture.

Notes

https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster
The UK Department of Health and Social Care adopted the more inclusive phrasing around 2017.

References

Chamberlain, B.P., Shirobokov, S., Rossi, E., Frasca, F., Markovich, T., Hammerla, N., Bronstein, M.M., Hansmire, M.: Graph neural networks for link prediction with subgraph sketching. ar**v preprint (2022)
Barbero, F., Bodnar, C., Ocáriz Borde, H.S., Bronstein, M., Veličković, P., Liò, P.: Sheaf neural networks with connection Laplacians. In: Topological, Algebraic and Geometric Learning Workshops, pp. 28–36. PMLR (2022)
Frasca, F., Bevilacqua, B., Bronstein, M.M., Maron, H.: Understanding and extending subgraph gnns by rethinking their symmetries. ar**v preprint (2022)
Shopon, M., Bari, A.H., Gavrilova, M.L.: Residual connection-based graph convolutional neural networks for gait recognition. Vis. Comput. 37, 2713–2724 (2021)
Article Google Scholar
Liu, Z.-Y., Liu, J.-W.: Hypergraph attentional convolutional neural network for salient object detection. Vis. Comput. 39(7), 2881–2907 (2023)
Article Google Scholar
Qin, Y., Mo, L., Li, C., Luo, J.: Skeleton-based action recognition by part-aware graph convolutional networks. Vis. Comput. 36, 621–631 (2020)
Article Google Scholar
Bicciato, A., Cosmo, L., Minello, G., Rossi, L., Torsello, A.: Gnn-lofi: a novel graph neural network through localized feature-based histogram intersection. Pattern Recogn. 8, 110210 (2023)
Google Scholar
Wang, Z., Liu, M., Luo, Y., Xu, Z., **e, Y., Wang, L., Cai, L., Qi, Q., Yuan, Z., Yang, T., et al.: Advanced graph and sequence neural networks for molecular property prediction and drug discovery. Bioinformatics 38(9), 2579–2586 (2022)
Article Google Scholar
Clipman, S.J., Mehta, S.H., Mohapatra, S., Srikrishnan, A.K., Zook, K.J., Duggal, P., Saravanan, S., Nandagopal, P., Kumar, M.S., Lucas, G.M., et al.: Deep learning and social network analysis elucidate drivers of hiv transmission in a high-incidence cohort of people who inject drugs. Sci. Adv. 8(42), 0158 (2022)
Article Google Scholar
Shi, W., Rajkumar, R.: Point-gnn: graph neural network for 3d object detection in a point cloud. In: CVPR, pp. 1711–1719 (2020)
Cosmo, L., Minello, G., Bronstein, M., Rodolà, E., Rossi, L., Torsello, A.: 3d shape analysis through a quantum lens: the average mixing kernel signature, pp. 1–20. IJCV (2022)
Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., Sun, M.: Graph neural networks: a review of methods and applications. AI Open 1, 57–81 (2020)
Article Google Scholar
Yi, H.-C., You, Z.-H., Huang, D.-S., Kwoh, C.K.: Graph representation learning in bioinformatics: trends, methods and applications. Brief. Bioinform. 23(1), 340 (2022)
Article Google Scholar
Thomas, J.J., Tran, T.H.N., Lechuga, G.P., Belaton, B.: Convolutional graph neural networks: a review and applications of graph autoencoder in chemoinformatics. In: Deep Learning Techniques and Optimization Strategies in Big Data Analytics, pp. 107–123 (2020)
Chen, C., Wu, Y., Dai, Q., Zhou, H.-Y., Xu, M., Yang, S., Han, X., Yu, Y.: A survey on graph neural networks and graph transformers in computer vision: a task-oriented perspective. ar**v preprint (2022)
Chen, S., Guhur, P.-L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. Adv. Neural. Inf. Process. Syst. 34, 5834–5847 (2021)
Google Scholar
Wang, H., Wang, W., Liang, W., **ong, C., Shen, J.: Structured scene memory for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8455–8464 (2021)
An, D., Wang, H., Wang, W., Wang, Z., Huang, Y., He, K., Wang, L.: Etpnav: evolving topological planning for vision-language navigation in continuous environments (2023). ar**v preprint ar**v:2304.03047
Deng, Z., Narasimhan, K., Russakovsky, O.: Evolving graphical planner: contextual global planning for vision-and-language navigation. Adv. Neural. Inf. Process. Syst. 33, 20660–20672 (2020)
Google Scholar
Zhao, Y., Chen, J., Gao, C., Wang, W., Yang, L., Ren, H., **a, H., Liu, S.: Target-driven structured transformer planner for vision-language navigation. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4194–4203 (2022)
Zhu, F., Liang, X., Zhu, Y., Yu, Q., Chang, X., Liang, X.: Soon: scenario oriented object navigation with graph-based exploration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12689–12699 (2021)
Wu, W., Chang, T., Li, X., Yin, Q., Hu, Y.: Vision-language navigation: a survey and taxonomy. Neural Comput. Appl. 8, 1–26 (2023)
Google Scholar
Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., Chang, X.: Dynamic graph enhanced contrastive learning for chest x-ray report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3334–3343 (2023)
Li, M., Cai, W., Verspoor, K., Pan, S., Liang, X., Chang, X.: Cross-modal clinical graph transformer for ophthalmic report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20656–20665 (2022)
Zhang, Y., Wang, X., Xu, Z., Yu, Q., Yuille, A., Xu, D.: When radiology report generation meets knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12910–12917 (2020)
Yusuf, A.A., Feng, C., Mao, X., Ally Duma, R., Abood, M.S., Chukkol, A.H.A.: Graph neural networks for visual question answering: a systematic review. Multimed. Tools Appl. 6, 1–38 (2023)
Google Scholar
Tang, Z., Sun, Z.-H., Wu, E.Q., Wei, C.-F., Ming, D., Chen, S.: MRCG: a MRI retrieval system with convolutional and graph neural networks for secure and private IOMT. IEEE J. Biomed. Health Inf. (2021)
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: ECCV, pp. 15–29. Springer (2010)
Laina, I., Rupprecht, C., Navab, N.: Towards unsupervised image captioning with shared multimodal embeddings. In: ICCV, pp. 7414–7424 (2019)
Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: ICCV, pp. 8928–8937 (2019)
Barlas, G., Veinidis, C., Arampatzis, A.: What we see in a photograph: content selection for image captioning. Vis. Comput. 37, 1309–1326 (2021)
Article Google Scholar
Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 6, 1–32 (2021)
Google Scholar
Liu, X., Xu, Q., Wang, N.: A survey on deep neural network-based image captioning. Vis. Comput. 35(3), 445–470 (2019)
Article Google Scholar
Zhong, J., Cao, Y., Zhu, Y., Gong, J., Chen, Q.: Multi-channel weighted fusion for image captioning. Vis. Comput. 6, 1–18 (2022)
Google Scholar
Hashemi Hosseinabad, S., Safayani, M., Mirzaei, A.: Multiple answers to a question: a new approach for visual question answering. Vis. Comput. 37, 119–131 (2021)
Article Google Scholar
Guo, Z., Han, D.: Multi-modal co-attention relation networks for visual question answering. Vis. Comput. 6, 1–13 (2022)
Google Scholar
Pradhan, J., Ajad, A., Pal, A.K., Banka, H.: Multi-level colored directional motif histograms for content-based image retrieval. Vis. Comput. 36(9), 1847–1868 (2020)
Article Google Scholar
Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: CVPR, pp. 3668–3678 (2015)
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV, pp. 684–699 (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.: Relational inductive biases, deep learning, and graph networks. ar**v preprint (2018)
Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. 51(6), 1–36 (2019)
Article Google Scholar
Zou, Y., **e, Q.: A survey on vqa: datasets and approaches. In: ITCA, pp. 289–297. IEEE (2020)
Yusuf, A.A., Chong, F., **anling, M.: An analysis of graph convolutional networks and recent datasets for visual question answering. Artif. Intell. Rev. 5, 1–24 (2022)
Google Scholar
Alexander, M., Gunasekaran, S.: A survey on image retrieval methods. Preprint (2014)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: ECCV, pp. 740–755. Springer (2014)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. ACL 2, 67–78 (2014)
Google Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: Fvqa: fact-based visual question answering. PAMI 40(10), 2413–2427 (2017)
Article Google Scholar
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Ok-vqa: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: CVPR, pp. 8317–8326 (2019)
Singh, A.K., Mishra, A., Shekhar, S., Chakraborty, A.: From strings to things: knowledge-enabled vqa model that can read and reason. In: ICCV, pp. 4602–4612 (2019)
Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph r-cnn for scene graph generation. In: ECCV (2018)
Zhang, C., Chao, W.-L., Xuan, D.: An empirical study on leveraging scene graphs for visual question answering. In: Sidorov, K., Hicks, Y. (eds.) BMVC, pp. 151–115114. BMVA Press (2019). https://dx.doi.org/10.5244/C.33.151
Wu, S., Wieland, J., Farivar, O., Schiller, J.: Automatic alt-text: computer-generated image descriptions for blind users on a social network service. In: ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 1180–1192 (2017)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10685–10694 (2019)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR, pp. 10578–10587 (2020)
He, S., Liao, W., Tavakoli, H.R., Yang, M., Rosenhahn, B., Pugeault, N.: Image captioning through image transformer. In: ACCV (2020)
Yang, X., Gao, C., Zhang, H., Cai, J.: Hierarchical scene graph encoder–decoder for image paragraph captioning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4181–4189 (2020)
Li, R., Liang, H., Shi, Y., Feng, F., Wang, X.: Dual-cnn: a convolutional language decoder for paragraph image captioning. Neurocomputing 396, 92–101 (2020)
Article Google Scholar
Jain, V., Al-Turjman, F., Chaudhary, G., Nayar, D., Gupta, V., Kumar, A.: Video captioning: a review of theory, techniques and practices. Multimed. Tools Appl. 81(25), 35619–35653 (2022)
Article Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. In: The Semantic Web, pp. 722–735. Springer (2007)
Liu, H., Singh, P.: Conceptnet—a practical commonsense reasoning tool-kit. BT Technol. J. 22(4), 211–226 (2004)
Article Google Scholar
Tandon, N., Melo, G., Weikum, G.: Acquiring comparative commonsense knowledge from the web. In: AAAI, vol. 28 (2014)
Iwana, B.K., Rizvi, S.T.R., Ahmed, S., Dengel, A., Uchida, S.: Judging a book by its cover. ar**v preprint (2016)
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Article Google Scholar
Han, X., Wu, Z., Huang, P.X., Zhang, X., Zhu, M., Li, Y., Zhao, Y., Davis, L.S.: Automatic spatially-aware fashion concept discovery. In: ICCV, pp. 1463–1471 (2017)
Cui, Z., Hu, Y., Sun, Y., Gao, J., Yin, B.: Cross-modal alignment with graph reasoning for image-text retrieval. Multimed. Tools Appl. 6, 1–18 (2022)
Google Scholar
Yoon, S., Kang, W.Y., Jeon, S., Lee, S., Han, C., Park, J., Kim, E.-S.: Image-to-image retrieval by learning similarity between scene graphs. In: AAAI, vol. 35, pp. 10718–10726 (2021)
Misraa, A.K., Kale, A., Aggarwal, P., Aminian, A.: Multi-modal retrieval using graph neural networks. ar**v preprint (2020)
Zhao, D., Wang, A., Russakovsky, O.: Understanding and evaluating racial biases in image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14830–14840 (2021)
Hirota, Y., Nakashima, Y., Garcia, N.: Gender and racial bias in visual question answering datasets. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1280–1292 (2022)
Yu, J., Zhu, Z., Wang, Y., Zhang, W., Hu, Y., Tan, J.: Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn. 108, 107563 (2020)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28 (2015)
Zhong, Y., Wang, L., Chen, J., Yu, D., Li, Y.: Comprehensive image captioning via scene graph decomposition. In: ECCV, pp. 211–229. Springer (2020)
Guo, L., Liu, J., Tang, J., Li, J., Luo, W., Lu, H.: Aligning linguistic words and visual semantic units for image captioning. In: ACM International Conference on Multimedia, pp. 765–773 (2019)
Zhou, D., Yang, J., Zhang, C., Tang, Y.: Joint scence network and attention-guided for image captioning. In: ICDM, pp. 1535–1540. IEEE (2021)
Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the Fourth Workshop on Vision and Language, pp. 70–80 (2015)
De Marneffe, M.-C., Manning, C.D.: The stanford typed dependencies representation. In: Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, pp. 1–8 (2008)
Teney, D., Liu, L., Den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR, pp. 1–9 (2017)
Pan, H., Huang, J.: Multimodal high-order relational network for vision-and-language tasks. Neurocomputing 492, 62–75 (2022)
Article Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: semantic propositional image caption evaluation. In: ECCV, pp. 382–398 (2016)
Yao, T., Pan, Y., Li, Y., Mei, T.: Hierarchy parsing for image captioning. In: ICCV, pp. 2621–2629 (2019)
Kan, J., Hu, K., Wang, Z., Wu, Q., Hagenbuchner, M., Tsoi, A.C.: Topic-guided local-global graph neural network for image captioning. In: ICME, pp. 1–6. IEEE (2021)
Dong, X., Long, C., Xu, W., **ao, C.: Dual graph convolutional networks with transformer and curriculum learning for image captioning. In: ICME, pp. 2615–2624 (2021)
Liu, C., Yu, G., Volkovs, M., Chang, C., Rai, H., Ma, J., Gorti, S.K.: Guided similarity separation for image retrieval. In: NeurIPS, vol. 32 (2019)
Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 165–174 (2016)
Sui, J., Yu, H., Liang, X., **, P.: Image caption method based on graph attention network with global context. In: 2022 7th International Conference on Image, Vision and Computing (ICIVC), pp. 480–487. IEEE (2022)
Chaudhuri, U., Banerjee, B., Bhattacharya, A.: Siamese graph convolutional network for content based remote sensing image retrieval. Comput. Vis. Image Underst. 184, 22–30 (2019)
Article Google Scholar
Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G.D., Gutierrez, C., Kirrane, S., Gayo, J.E.L., Navigli, R., Neumaier, S., et al.: Knowledge graphs. ACM Comput. Surv. 54(4), 1–37 (2021)
Article Google Scholar
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2020)
Article MathSciNet Google Scholar
Yang, S., Zhang, M., Dong, W., Li, D.: Betty: enabling large-scale GNN training with batch-level graph partitioning. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, vol. 2, pp. 103–117 (2023)
Kong, K., Chen, J., Kirchenbauer, J., Ni, R., Bruss, C.B., Goldstein, T.: Goat: a global transformer on large-scale graphs. In: International Conference on Machine Learning, pp. 17375–17390. PMLR (2023)
Morris, C., Ritzert, M., Fey, M., Hamilton, W.L., Lenssen, J.E., Rattan, G., Grohe, M.: Weisfeiler and leman go neural: higher-order graph neural networks. In: AAAI, vol. 33, pp. 4602–4609 (2019)
Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: NeurIPS, pp. 1025–1035 (2017)
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.S.: Gated graph sequence neural networks (2015). CoRR abs/1511.05493
Cho, K., Merrienboer, B., Gülçehre, Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (2014)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: ICLR. (Accepted as poster) (2018)
Khademi, M.: Multimodal neural graph memory networks for visual question answering. In: Proceedings of the 58th Annual Meeting of the ACL, pp. 7177–7188 (2020)
Chen, D., Lin, Y., Li, W., Li, P., Zhou, J., Sun, X.: Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In: AAAI, vol. 34, pp. 3438–3445 (2020)
Bodnar, C., Giovanni, F.D., Chamberlain, B.P., Liò, P., Bronstein, M.M.: Neural sheaf diffusion: a topological perspective on heterophily and oversmoothing in GNNs. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) NeurIPS (2022). https://openreview.net/forum?id=vbPsD-BhOZ
Abu-El-Haija, S., Perozzi, B., Kapoor, A., Alipourfard, N., Lerman, K., Harutyunyan, H., Ver Steeg, G., Galstyan, A.: Mixhop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In: ICML, pp. 21–29. PMLR (2019)
Bodnar, C., Frasca, F., Otter, N., Wang, Y., Lio, P., Montufar, G.F., Bronstein, M.: Weisfeiler and lehman go cellular: Cw networks. NeurIPS 34, 2625–2640 (2021)
Google Scholar
Chamberlain, B., Rowbottom, J., Gorinova, M.I., Bronstein, M., Webb, S., Rossi, E.: Grand: graph neural diffusion. In: ICML, pp. 1407–1418. PMLR (2021)
Lakshminarasimhan Srinivasan, D.S., Amutha, A.: Image captioning-a deep learning approach. Int. J. Appl. Eng. Res. 13(9), 7239–7242 (2018)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the ACL, pp. 311–318 (2002)
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Banerjee, S., Lavie, A.: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72 (2005)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Monti, F., Otness, K., Bronstein, M.M.: Motifnet: a motif-based graph convolutional network for directed graphs. In: 2018 IEEE Data Science Workshop (DSW), pp. 225–228. IEEE (2018)
Song, Z., Zhou, X.: Exploring explicit and implicit visual relationships for image captioning. In: ICME, pp. 1–6. IEEE (2021)
Yang, X., Peng, J., Wang, Z., Xu, H., Ye, Q., Li, C., Yan, M., Huang, F., Li, Z., Zhang, Y.: Transforming visual scene graphs to image captions (2023). ar**v preprint ar**v:2305.02177
Wang, J., Wang, W., Wang, L., Wang, Z., Feng, D.D., Tan, T.: Learning visual relationship and context-aware attention for image captioning. Pattern Recogn. 98, 107075 (2020)
Article Google Scholar
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. ar**v preprint (2015)
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. Adv. Neural Inf. Process. Syst. 32, 55 (2019)
Google Scholar
Sharma, H., Jalal, A.S.: Visual question answering model based on graph neural network and contextual attention. Image Vis. Comput. 110, 104165 (2021)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: ICCV, pp. 10313–10322 (2019)
Nuthalapati, S.V., Chandradevan, R., Giunchiglia, E., Li, B., Kayser, M., Lukasiewicz, T., Yang, C.: Lightweight visual question answering using scene graphs. In: ACM International Conference on Information and Knowledge Management, pp. 3353–3357 (2021)
Zhu, Z., Yu, J., Sun, Y., Hu, Y., Wang, Y., Wu, Q.: Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: IJCAI (2020)
Narasimhan, M., Lazebnik, S., Schwing, A.: Out of the box: reasoning with graph convolution nets for factual visual question answering. NeurIPS, vol. 31 (2018)
Liu, L., Wang, M., He, X., Qing, L., Chen, H.: Fact-based visual question answering via dual-process system. Knowl. Based Syst. 237, 107650 (2022)
Article Google Scholar
Stanovich, K.E., West, R.F.: 24 individual differences in reasoning: implications for the rationality debate? Behav. Brain Sci. 23(5), 665–726 (2000)
Article Google Scholar
Wang, Y., Yasunaga, M., Ren, H., Wada, S., Leskovec, J.: VQA-GNN: reasoning with multimodal knowledge via graph neural networks for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21582–21592 (2023)
Wang, Z., You, H., Li, L.H., Zareian, A., Park, S., Liang, Y., Chang, K.-W., Chang, S.-F.: Sgeitl: scene graph enhanced image-text learning for visual commonsense reasoning. In: AAAI, vol. 36, pp. 5914–5922 (2022)
Gao, F., **, Q., Thattai, G., Reganti, A., Wu, Y.N., Natarajan, P.: A thousand words are worth more than a picture: natural language-centric outside-knowledge visual question answering. ar**v preprint (2022)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(140), 1–67 (2020)
MathSciNet Google Scholar
Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. In: CVPR, pp. 12746–12756 (2020)
Liang, Y., Wang, X., Duan, X., Zhu, W.: Multi-modal contextual graph neural network for text visual question answering. In: ICPR, pp. 3491–3498 (2021). DOIurlhttps://doi.org/10.1109/ICPR48806.2021.9412891
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017). https://openreview.net/forum?id=SJU4ayYgl
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? ar**v preprint (2018)
Zhang, X., Jiang, M., Zheng, Z., Tan, X., Ding, E., Yang, Y.: Understanding image retrieval re-ranking: a graph neural network perspective. ar**v preprint (2020)
Wang, M., Zhou, W., Tian, Q., Li, H.: Deep graph convolutional quantization networks for image retrieval. IEEE Trans. Multimed. (2022)
Zhang, F., Xu, M., Mao, Q., Xu, C.: Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. In: ACM International Conference on Multimedia, pp. 3367–3376 (2020)
Chaudhuri, U., Banerjee, B., Bhattacharya, A., Datcu, M.: Attention-driven graph convolution network for remote sensing image retrieval. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2021)
Google Scholar
Zhang, Z., Zhang, Y., Feng, R., Zhang, T., Fan, W.: Zero-shot sketch-based image retrieval via graph convolution network. In: AAAI, vol. 34, pp. 12943–12950 (2020)
Zhang, B., **ong, D., Su, J., Duan, H., Zhang, M.: Variational neural machine translation. In: Conference on Empirical Methods in Natural Language Processing, pp. 521–530. ACL, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1050
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ar**v preprint (2020)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Chang, X., Ren, P., Xu, P., Li, Z., Chen, X., Hauptmann, A.: A comprehensive survey of scene graphs: generation and application. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 1–26 (2021)
Article Google Scholar
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)
Betker, J., Goh, G., **g, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf (2023)
Xu, S.: Clip-diffusion-lm: apply diffusion model on image captioning. ar**v preprint (2022)
Li, H., Gu, J., Koner, R., Sharifzadeh, S., Tresp, V.: Do dall-e and flamingo understand each other? ar**v preprint (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. ar**v preprint (2022)
Conwell, C., Ullman, T.: Testing relational understanding in text-guided image generation. ar**v preprint (2022)
Wei, C., Liu, C., Qiao, S., Zhang, Z., Yuille, A., Yu, J.: De-diffusion makes text a strong cross-modal interface (2023). ar**v preprint ar**v:2311.00618
Bigham, J.P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R.C., Miller, R., Tatarowicz, A., White, B., White, S., et al.: Vizwiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, pp. 333–342 (2010)
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: a jointly-scaled multilingual language-image model. ar**v preprint (2022)
Zeng, Y., Zhang, X., Li, H., Wang, J., Zhang, J., Zhou, W.: X$^2$-vlm: all-in-one pre-trained model for vision-language tasks. ar**v preprint (2022)
Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al.: MPLUG: effective and efficient vision-language learning by cross-modal skip-connections. ar**v preprint (2022)

Download references

Author information

Authors and Affiliations

Digital Environment Research Institute, Queen Mary University London, New Road, London, E1 1HH, UK
Henry Senior, Gregory Slabaugh & Shanxin Yuan
Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Luca Rossi

Authors

Henry Senior
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Slabaugh
View author publications
You can also search for this author in PubMed Google Scholar
Shanxin Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Luca Rossi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.S. conducted the literature survey and developed the arguments presented in the paper. G.S, S.Y, and L.R helped refine and shape the arguments presented. All authors reviewed the manuscript.

Corresponding author

Correspondence to Henry Senior.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Senior, H., Slabaugh, G., Yuan, S. et al. Graph neural networks in vision-language image understanding: a survey. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03343-0

Download citation

Accepted: 26 February 2024
Published: 29 March 2024
DOI: https://doi.org/10.1007/s00371-024-03343-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Graph neural networks in vision-language image understanding: a survey

Abstract

Similar content being viewed by others

A review of convolutional neural networks in computer vision

A survey on Image Data Augmentation for Deep Learning

CBAM: Convolutional Block Attention Module

1 Introduction

2.2 Fundamental graph theoretical concepts

2.3 Common graph types in 2D vision-language tasks

3 An overview of graph neural networks

3.1 Graph convolutional networks

3.2 Gated graph neural networks

3.3 Graph attention networks

3.4 Graph memory networks

3.5 Modern graph neural network architectures

4 Image captioning

5.1 VQA

5.2 Knowledge-/fact-based VQA

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Graph neural networks in vision-language image understanding: a survey

Abstract

Similar content being viewed by others

A review of convolutional neural networks in computer vision

A survey on Image Data Augmentation for Deep Learning

CBAM: Convolutional Block Attention Module

1 Introduction

2.2 Fundamental graph theoretical concepts

2.3 Common graph types in 2D vision-language tasks

3 An overview of graph neural networks

3.1 Graph convolutional networks

3.2 Gated graph neural networks

3.3 Graph attention networks

3.4 Graph memory networks

3.5 Modern graph neural network architectures

4 Image captioning

5.1 VQA

5.2 Knowledge-/fact-based VQA

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation