1 Introduction

Recent years have seen an explosion of research into graph neural networks (GNNs), with a flurry of new architectures being presented in top-tier machine learning conferences and journals every year [28] and yielded good results without the need for deep learning. However, not all images may have appropriate captions. If the captions are generic, they will only be able to describe aspects of an image and may omit its most important feature. In contrast, template-based captioning [56] uses a predefined caption format and uses object detection to fill in the blanks. This approach is good for generating consistent captions, but can result in captions that are unnatural and clearly generated by a machine. Contemporary approaches to the task of image captioning are based on deep learning models. Early work focused on a CNN encoder feeding an RNN-based decoder [57]; however, more recent deep learning approaches have developed to incorporate a wide variety of techniques including GNNs [39, 58] and Transformers [59, 60]. In this survey, we focus specifically on deep learning approaches to image captioning and focus on graph-based approaches. Deep learning approaches are typically trained on the COCO [47] or Flickr30k [48] which contain a set of images accompanied by five human generated captions. Closely related to contemporary deep learning-based captioning are the tasks of paragraph captioning and video captioning. Paragraph image captioning is the challenge of generating a multi-sentence description of an image [61, 62], whilst video captioning focuses on describing videos. Readers interested in video captioning are directed to the recent survey [63].

Taxonomies of VQA are usually defined through the lens of the datasets used by the various tasks [70], IMBd, and [69], respectively. Adjacent to the task of VQA and its sub-tasks is the field of Visual Grounding (sometimes known as Referring Expression). This is the task of identifying the salient regions of an image based on a natural language query. Although the task is closely aligned with VQA, it falls outside the scope of this paper. We direct the readers to [15].

Image retrieval spans multiple tasks, all of which make use of deep learning in contemporary approaches. We follow the taxonomy of Alexander et al. [45] and address the following sub-tasks: 1) text-based image retrieval where images are returned based on a text query; 2) content-based image retrieval where images are retrieved based on their similarity to an input image; 3) sketch-based retrieval, where images are retrieved based on their similarity to a sketch; 4) semantic-based retrieval which returns images based on their perceptual content; and 5) annotation-based retrieval, where images are returned using meta-data annotations. The number of datasets used for image retrieval are vast, and the community has not solidified around a single dataset in the way image captioning has around COCO [47]. This presents a challenge when making accurate comparisons between systems as the challenge presented by different datasets varies complicating direct comparisons across datasets. Whilst image retrieval specific datasets exist [71], there are papers [72,73,74] that make use of image captioning datasets [47, 48], showing the wide range of varied datasets that exist for image retrieval.

Understanding the inherent biases in datasets is incredibly important for deep learning researchers and practitioners. As models move beyond research benchmarks and into mainstream use, models that are trained on biased data will produce biased outputs and may contribute to the proliferation of harmful stereotypes. Within the scope of Vision-Language, work has been done to discover negative biases in core datasets such as COCO, enabling researchers to mitigate these risks [75]. Comprehensively demonstrated the gender, racial, and Western biases that exist in the COCO [47] dataset. The research finds lighter-skinned individuals are \(7.5\times \) more common than darker-skinned individuals, and males are \(2\times \) more common than females. This leads to the concern that image captioning models may come to see the world as being predominantly occupied by light-skinned males. Worryingly, [75] also find the existence of racial slurs in the ground truth captions which leads to concerns about captioning models producing captions containing this derogatory language.

Hirota et al. [76] continue the work of [75] but focus on VQA datasets. They find evidence of bias in Visual Genome [46] and OK-VQA [51], two widely used VQA datasets. The biases include a reflection of traditional gender stereotypes and a US-centric viewpoint on race and nationality.

In addition to the gender and racial biases in existing Vision-Language datasets, a lot of the datasets are limited by style. The vast majority of datasets contain real world photographs (typically mined from Flickr), which limits models to only understanding photographs. This limitation is most significant in image captioning and image retrieval as the style is a significant component of either the caption or retrieval query. Unless the question asked is specifically about the style of the image, then the impact is somewhat limited for VQA.

2.2 Fundamental graph theoretical concepts

Undirected graph. We define an undirected graph G to be a tuple of sets V and E, i.e. \(G = (V,E)\). The set V contains n vertices (sometimes referred to as nodes) that are connected by the edges in the set E, i.e. if \(v \in V\) and \(u \in V\) are connected by an edge then \(e_{v,u} \in E\). For an undirected graph, we have that \(e_{v,u} = e_{u,v}\).

Directed graph A directed graph is a graph where the existence of \(e_{v,u}\) does not imply the existence of \(e_{u,v}\) as well. Let A be the \(n \times n\) binary adjacency matrix such that \(\textbf{A}_{v,u} = 1\) if \(e_{v,u} \in E\). Then it follows that \(\textbf{A}\) is asymmetric (symmetric) for directed (undirected) graphs. More in general, \(\textbf{A}\) can be a real-valued matrix, where the value of \(\textbf{A}_{v,u}\) can be interpreted as the strength of the connection between v and u.

Neighbourhood. The neighbourhood \(\mathcal {N}(v)\) of a vertex \(v \in V\) is the subset of nodes in V that are connected to v. The neighbour u can be either directly connected to v, i.e. \((v,u) \in E\), or indirectly connected by traversing r edges from v to u. Note that some definitions include v itself as part of the neighbourhood.

Complete graph. A complete graph is one (directed or undirected) where for each vertex, there is an edge connecting it to every other vertex in the set V. A complete graph is therefore a graph with the maximum number of edges for a given number of nodes.

Multi-partite graph. A multi-partite graph (also known as K-partite graph) is a graph where the nodes can be separated into K different sets. For scene understanding tasks, this allows for a graph representation where one set of nodes represent objects and another represents relationship between objects.

Multi-modal graph. A multi-modal graph is one with nodes that have features from different modalities. This approach is commonly used in VQA where the image and text modalities are mixed. Multi-modal graphs enable visual features to coexist in a graph with word embeddings.

2.3 Common graph types in 2D vision-language tasks

This section organises the various graph types used across all three tasks discussed in the survey. Some graphs, such as the semantic and spatial graphs, are used across all tasks [39, 55, 73], whilst others are more domain specific, like the knowledge graph [53, 77]. Figure 2 shows a sample image from the COCO dataset [47] together with various types of graphs that can be used to describe it. This section, alongside the figure, is organised so that graph that represent a single image and graphs that represent portions the dataset are grouped together.

Semantic graph, multi-partite semantic graph, and textual semantic graph. Sometimes referred to as a scene graph, a semantic graph (shown in Fig. 2b) is a one that encapsulates the semantic relationships between visual objects within a scene. Across the literature, the terms ‘semantic graph’ and ‘scene graph’ are used somewhat interchangeably, depending on the paper. However, in this survey we use the term ‘semantic graph’ because there are many ways to describe a visual scene as a graph, whereas the ‘semantic graph’ label is more precise about what the graph represents. Semantic graphs come in different flavours. One approach is to define a directed graph with nodes representing visual objects extracted by an object detector such as Faster-RCNN [78] and edges representing semantic relationships between them. This is the approach of Yao et al. [39], where, using a dataset such as Visual Genome [46], a model predicts the semantic relationships to form edges in the graph. Alternatively, the semantic graph can be seen as a multi-partite graph [58, 79,80,81] (shown in Fig. 2c), where attribute nodes describe the object nodes they are linked to. They also change the way relationships are represented by using nodes rather than edge features. This yields a semantic graph with three node types: visual object, object attribute, and inter-object relationship. This definition follows that of the ‘scene graph’ defined by Johnson et al. [38]. Finally, another form of semantic graph exists, the textual semantic graph[58, 82] (shown in Fig. 2d). Unlike visual semantic graphs, textual ones are not generated from the image itself but rather its caption. Specifically, the caption is parsed through the Stanford Dependency Parser [83], a widely used [84, 85] probabilistic sentence parser. Given a caption, the parser will return its grammatical structure, identifying components such nouns, verbs, and adjectives and marking the relationship between them. This is then modified from a tree into a graph, following the techniques outlined in [86].

Spatial graph. Yao et al. [39] define a spatial graph (Fig. 2e) as one representing the spatial relationship between objects. Visual objects detected by an object detector form nodes, and the edges between the nodes represent one of 11 predefined spatial relationships that may occur between the two objects. These include inside (labelled ‘1’), cover (labelled ‘2’), overlap (labelled ‘3’), and eight positional relationships (labelled ‘4’–‘11’) based on the angle between the centroid of the two objects. These graphs are directional but will not always be complete as there are cases where two objects have a weak spatial relationship and are therefore not connected by an edge in the spatial graph. Guo et al. [80] define a graph of a similar nature known as a geometry graph. It is defined as an undirected graph that encodes relative spatial positions between objects with an overlap and relative distance that meet certain thresholds.

Hierarchical spatial (Tree). These graphs build on from the spatial graph but the relationships between nodes focus on the hierarchical nature of the spatial relationship between the detected objects within an image. Yao et al. [87] propose to use a tree (i.e. a graph where each pair of nodes is connected by a single path) to define a hierarchical image representation. An image (\(\mathcal {I}\)) is first divided into regions using Faster-RCNN [78] (\(\mathcal {R} = \{r_i\}^K_{i=1}\)) with each region being further divided into instance segmentations (\(\mathcal {M} = \{m_i\}^K_{i=1}\)). This gives a three-layer tree structure (\(\mathcal {T} = (\mathcal {I}, \mathcal {R}, \mathcal {M}, \mathcal {E}_{tree})\), where \(\mathcal {E}_{tree}\) is the set of connecting edges) to represent the image, as shown in Fig. 2f. He et al. [60] use a hierarchical spatial graph, with relationships representing ‘parent’, ‘child’, and ‘neighbour’ relationships depending on the intersection over union of the bounding boxes.

Similarity graph. The similarity graph (Fig. 2) proposed by Kan et al.[88] (referred to as a semantic graph by the authors) is generated by computing the dot product between two visual features extracted by Faster-RCNN [78]. The dot products are then used to form the values of an adjacency matrix A as the operation captures the similarity between two vectors, the higher the dot product, the closer the two vectors are. Faster-RCNN extracts a set of n visual features, where each feature x(v) is associated to a node v and the value of the edge between two nodes v and u is given by \(\textbf{A}_{u,v} = \sigma \left( x(v)^T \textbf{M} x(u)\right) \), where \(\sigma (\cdot )\) is a nonlinear function and \(\textbf{M}\) is a learnt weight matrix. The authors of [88] suggest that generating the graph this way allows for relationships between objects to be discovered in a data-driven manner, rather than relying on a model trained on a dataset such as the Visual Genome [46].

Image graphs/K-nearest neighbour graph. In their 2021 image captioning work, Dong et al. [89] construct an image graph by converting images into a latent feature space by averaging the object vectors output by feeding the image into Faster-RCNN [78]. The K closest images from the training data or search space in terms of \(l_2\) distance are then turned into an undirected complete graph, shown in Fig. 2h. This is a similar approach used by Liu et al. [90] with their K-nearest neighbour graph.

Topic graph. Proposed by Kan et al. [88], the topic graph is an undirected graph of nodes representing topics extracted by GPU-DMM [91]. Topics are latent features representing shared knowledge across the entire caption set. Modelling them as a graph, as shown in Fig. 2i, with edges computed by taking the dot product of the two nodes, allows the modelling of knowledge represented in the captions.

Region adjacency graph. A region adjacency graph (RAG) is a graph made up of nodes representing homogeneous segments of the image, with edges representing the connection of adjacent regions. There are different approaches to defining the regions, but patches or superpixels are commonly used. Patches, small equal divisions of the image, are used by Sui et al. [92] whilst superpixels, an unsupervised segmentation approach for clustering nearby pixels, are used by [93].

Knowledge graph. A knowledge graph, or fact graph, is a graph-based representation of information. Whilst there is no agreed structure of these graphs [94], they typically take the form of triplets. They are used in a wide variety of tasks to provide the information needed to ‘reason’. Hence, knowledge graphs enable the FVQA task.

Fig. 2
figure 2

A visual comparison of the various graph types used across vision-language tasks. Best viewed in colour

3 An overview of graph neural networks

Over the past years, a large number of GNN architectures have been introduced in the literature. Wu et al. [95] proposed a taxonomy containing four distinct groups: recurrent GNNs, convolutional GNNs, autoencoder GNNs, and spatial–temporal GNNs. The applications discussed in this paper mostly utilise convolutional GNNs, for a comprehensive overview of other architectures readers are directed to [95]. GNNs, especially traditional architectures such as graph convolutional network, have a deep grounding in relational inductive biases [41]. They are built on the assumption of homophily, i.e. that connected nodes are similar. There is an increasing body of work looking into addressing some of the bottlenecks that GNNs may suffer from. Novel training strategies such as [96] have been shown to reduce GPU memory whilst approaches such as [97] reduce the difference in performance when dealing with homophilic or heterophilic graphs.

3.1 Graph convolutional networks

One common convolutional GNN architecture is the message passing neural networks (MPNNs) proposed by Gilmer et al. Although this architecture has been shown to be limited [98], it forms a good abstraction of GNNs.

Gilmer et al. describe MPNNs as being comprised of a message function, update function, and readout function. These functions will vary depending on the application of the network, but are learnable, differentiable, and permutation invariant. The message and update functions will run for a number of time steps T, passing messages between connected nodes of the graph. These are used to update the hidden feature vectors of the nodes, which are then used to update the node feature vector, which in turn is used in the readout function.

The messages are defined as

$$\begin{aligned} \textbf{m}^{(t + 1)}_v = \sum _{u \in \mathcal {N}(v)} M_t(\textbf{h}^{(t)}_v, \textbf{h}^{(t)}_u, \textbf{e}_{v,u}) \,, \end{aligned}$$
(1)

where a message for a node at the next time step \(\textbf{m}^{(t + 1)}_v\) is given by combining its current hidden state \(\textbf{h}^{(t)}_v\) with that of its neighbour \(\textbf{h}^{(t)}_u\) and any edge feature \(\textbf{e}_{v,u}\) in a multi-layer perceptron (MLP) \(M_t(\cdot )\). Given that a message is an aggregation of all the connected nodes, the summation acts over the nodes connected to the node \(u \in \mathcal {N}(v)\), i.e. the neighbourhood of v.

These messages are then used to update the hidden vectors by combining the node current state with the message in an MLP \(U_t\).

$$\begin{aligned} \textbf{h}^{(t+1)}_v = U_t(\textbf{h}^t_v, \textbf{m}^{(t+1)}_v)\end{aligned}$$
(2)

Once the message passing phase has run for T time steps, a readout phase is then conducted using a readout function, \(R(\cdot )\). This is defined as an MLP that considers the updated feature vectors of nodes (\(\textbf{h}^T_v\)) on the whole graph (\(v \in G\)) to produce a prediction and is defined as:

$$\begin{aligned} \hat{y} = R(\{\textbf{h}^T_v | v \in G\})\end{aligned}$$
(3)

In order to make the GCN architecture scale to large graphs, the GraphSAGE [99] architecture changes the message function. Rather than taking messages from the entire neighbourhood of a node, a random sample is used. This reduces the number of messages that require processing, resulting in an architecture that works well on large graphs.

3.2 Gated graph neural networks

The core idea behind the gated graph neural network (GGNN) [100] is to replace the update function from the message passing architecture (Eq. 2) with a gated recurrent unit (GRU) [101]. The GRU is a recurrent neural network with a update and reset gates that controls which data can flow through the network (and be retained) and which data cannot (and therefore be forgotten).

$$\begin{aligned} \textbf{h}^{(t+1)}_v = GRU\left( \textbf{h}^{(t)}_v, \sum _{u \in \mathcal {N}(v)}\textbf{W}\textbf{h}^{(t)}_u\right) \,. \end{aligned}$$
(4)

where \(\textbf{h}^t_v\) is the hidden feature of node v at time t, \(\textbf{W}\) is a learnt weight matrix, and \(u \in \mathcal {N}(v)\) is the subset of nodes in the graph connected to node v.

The GGNN also replaces the message function from Eq. 1 with a learnable weight matrix. Using the GRU alongside back-propagation through time enables the GGNN to operate on series data. However, due to the recurrent nature of the architecture, it can become unfeasible in terms of memory to run the GGNN on large graphs.

3.3 Graph attention networks

Following on from the multi-head attention mechanism of the popular Transformer architecture [40], graph attention networks (GATs) [102] extend the common GCN to include this attention attribute. Using an attention function, typically modelled by an MLP, the architecture calculates an attention weighting between two nodes. This process is repeated K times using K attention heads in parallel. The attention scores are then averaged to give the final weights.

The self-attention is computed by a function \(a(h^t_v, h^t_u)\) (typically an MLP) that attends to the hidden representation of a node (\(h^t_v\)) and one of its neighbours (\(h^t_u\)). Once every node pairing in the graph has their attention computed, the scores are passed through a softmax function to give a normalised attention coefficient (\(\alpha _{v,u}\)). This process is then extended to multi-head attention by repeating the process across K different attention heads, each with different initialisation weights. The final node representation is achieved by concatenating or averaging (represented as \(\Vert \)) the K attention heads together.

$$\begin{aligned} \textbf{h}_v^{(t+1)} = \Bigg \Vert ^K_{k=1} \sigma \left( \sum _{u \in \mathcal {N}(v)} \alpha _{v,u}^{(k)} \textbf{W}^{(k)} \textbf{h}_u\right) \end{aligned}$$
(5)

where \(\sigma \) is a nonlinear activation function such as ReLU and \(\textbf{W}\) is a learnable weight matrix.

3.4 Graph memory networks

Recent years have seen the development of graph memory networks, which can conceptually be thought of as models with an internal and external memory. When there are multiple graphs overlap** the same spatial information, as in [103], the use of some form of external memory can allow for an aggregation of node updates and the graph undergoes message passing. This essentially allows for features from multiple graphs to be combined in some way that goes beyond a more simplistic pooling operation. In the case of Khademi [103], two graphs are constructed across the same image but may have different nodes. These graphs are updated using a GGNN. An external spatial memory is constructed to aggregate information from across the graphs as they are updated, using a neural network with an attention mechanism. The final state of the spatial memory is used to perform the final task.

3.5 Modern graph neural network architectures

In recent years, the limits of message passing GNNs have become increasingly evident, from their tendency to oversmooth the input features as the depth of the network increases [104] to their unsatisfactory performance in heterophilic settings [105], i.e. when neighbouring nodes in the input graphs are dissimilar. Furthermore, the expressive power of GNNs based on the message passing mechanism has been shown to be bounded by that of the well-known Weisfeiler–Lehman isomorphism test [98], meaning that there are inherent limits to their ability to generate different representations for structurally different input graphs.

Motivated by the desire to overcome these issues, researchers have now started looking at alternative models that move away from standard message passing architectures. Efforts in this direction include, among many others, higher-order message passing architectures [106], cell complexes networks [107], networks based on diffusion processes [2, 105, 108]. To the best of our knowledge, the application of these architectures to the 2D image understanding tasks discussed in this paper has not been explored yet. As such, we refer the readers to the referenced papers for detailed information on the respective architectures.

4 Image captioning

Image captioning is the challenging task of producing a natural language description of an image. Outside of being an interesting technical challenge, it presents an opportunity to develop accessibility technologies for severely sight impaired (formally ‘blind’) and sight impaired users (formally ‘visually impaired’Footnote 2). Additionally, it has applications in problems ranging from image indexing [109] to surveillance [88]. There are three forms of image captioning techniques: 1) retrieval-based captioning, where a caption is retrieved from a set of existing captions, 2) template-based captioning, where a pre-existing template is filled in using information extracted from the image, and 3) deep learning-based image captioning, where a neural network is tasked with generating a caption from an input image. We propose to refine this taxonomy to differentiate between GNN-based approaches and more traditional deep learning powered image captioning. The following section details the GNN-based approaches to image captioning, of which there have been a number of in recent years. Figure 3 illustrates the structure of a generic GNN-based image captioning architecture.

Fig. 3
figure 3

An abstract overview of GNN-based image captioning architectures discussed in this section. Most architectures extract image features and use them to construct at least one graph to represent the image. Some papers [88, 39]) or nodes (in the case of [80]) that encode relationships map clearly to prepositions, and nodes representing attributes map to adjectives. This strong relationship between the graph structure generated by the encoder and the final sentence outputted by the decoder further supports the use of the image graph–sentence architecture used by many image captioning systems.

Zhou et al. [81] use an LSTM alongside a Faster-RCNN [78]-based image feature extractor, with the addition of a visual self-attention mechanism. The authors make use of a multi-partite semantic graph, following the style of [38, 80]. Specifically, they propose to use three GCNs to create context-aware feature vectors for each of the object, attribute, and relationship nodes. The resulting context-aware nodes undergo fusion with the self-attention maps, enabling the model to control the granularity of captions. Finally, the authors test two methods of training an LSTM-based language generator: the first being a traditional supervised approach with cross-entropy loss and the second being a reinforcement learning-based approach that uses CIDEr [113] as the reward function. By utilising context dependent GCNs in their architecture, to specifically account for the object, attribute, and relationship nodes, SASG is able to achieve competitive results when compared with similar models, as shown in Table 3.

SGAE (scene graph autoencoder) is another paper to make use of a multi-partite semantic graph. In the paper, Yang et al. [58] take a caption and convert it into a multi-partite textual semantic graph using a similar process to that of the SPICE metric [86] (detailed further in Table 2). The nodes of the graph are converted to word embeddings which are then converted into feature embeddings by way of a GCN, with each node type being given its own GCN with independent parameters. These feature embeddings are then combined with a dictionary to enable them to be re-encoded before they are used to generate a sentence. The dictionary weights are updated via back-propagating the cross-entropy loss from the sentence regeneration. By including a dictionary, the authors are able to learn inductive biases from the captions. This allows generated captions to go from ‘man on motorcycle’ to ‘man riding motorcycle’. When given an image, SGAE generates a multi-partite visual semantic graph, similar to [38, 80], using Faster-RCNN [78] and MotifNet [114]. These visual features are then combined with their word embeddings through a multi-modal GCN and then re-encoded using the previously learnt dictionary. These features are then used to generate the final sentence.

Yang et al. [116] take a multi-partite semantic graph and input it to a multi-head attention-based GNN. The MHA-GNN is based on the Transformer architecture in that a multi-head self-attention is computed between all the nodes of the graph. However, the output of the self-attention is masked by an adjacency matrix prior to the softmax. Doing so enables a self-attention mechanism that maintains the original semantic graph structure. Additionally, the model makes use of Mixture of Experts (MoE) decoding, a first for image captioning. Each node type (object, relationship, attribute) gets its own decoder, and the output is put through a soft router which computes the final output token.

Rather than utilising multiple graphs, Wang et al. [117] instead use a single fully connected spatial graph with an attention mechanism to learn the relationships between different regions. This graph is formed of nodes that represent the spatial information of regions within the image. Once formed, it is passed through a GGNN [100] to learn the weights associated with the edges. Once learnt, these edge weights correspond to the probability of a relationship existing between the two nodes.

The work of Yao et al. [87], following on from their GCN-LSTM [39], presents an image encoder that makes use of a novel hierarchy parsing (HIP) architecture. Rather than encoding the image in a traditional scene graph structure like most contemporary image captioning papers [39, 79, 89], Yao et al. [87] take the novel approach of using a tree structure (discussed in Sect. 2.3), exploiting the hierarchical nature of objects in images. Unlike their previous work which focused on the semantic and spatial relationships, this work is about the hierarchical structure within an image. This hierarchical relationship can be viewed as a combination of both semantic and spatial information—therefore merging the two graphs used previously. The feature vectors representing the vertices on the tree are then improved through the use of Tree-LSTM [118]. As trees are a special case graph, the authors also demonstrate that their previous work GCN-LSTM [39] can be used to create enriched embeddings from the tree before decoding it with an LSTM. They demonstrate that the inclusion of the hierarchy passing improves scores on all benchmarks when compared with GCN-LSTM [39], which does not use hierarchical relationships.

The work of He et al. [60] build on the idea of a hierarchical spatial relationships proposed by Yao et al.[87]. However, rather than use a tree to represent these relationships, they use a graph with three relationship types: parent, neighbour, and child. They then propose a modification to the popular Transformer layer to better adapt it to the task of image processing. After detecting objects using Faster-RCNN [78], a hierarchical spatial relationship graph is constructed. Three adjacency matrices are then built from this graph to model the three relationship types (\(\Omega _p, \Omega _n, and \Omega _c\), respectively). The authors modify the Transformer layer so that rather compute self-attention across the whole spatial graph, there is a sub-layer for each relationship type. Each sub-layer processes the query Q with its own key \(K_i\) and value \(V_i\) with the modified attention mechanism:

$$\begin{aligned} Attention(Q, K_i, V_i) = \Omega _i \odot Softmax \left( \frac{QK_i^T}{\sqrt{d}} \right) V_i \end{aligned}$$
(6)

where \(\odot \) is the Hadamard product and i refers to the relationship type \(i \in \{parent, neighbour, child\}\) and is used to specify the adjacency matrix (\(\Omega \)) used. Finally, \(\sqrt{d}\) is used as a regularisation technique with d being the dimension of \(K_i\). Using the Hadamard product essentially zeroes out the attention between regions whose relationship is not being processed by that sub-layer. The resulting encodings are decoded by an LSTM to produce captions.

Like [60], the \(\mathcal {M}2\) meshed memory Transformer proposed by Cornia et al. [59] also makes use of the increasingly popular Transformer architecture [40]. Unlike other papers [39, 58, 60, 87] which make use of some predefined structure on extracted image features (spatial graph, semantic graph, etc.), \(\mathcal {M}2\) uses stacks of self-attention layers across the set of all the image regions. The standard key and values from the Transformer are edited to include the concatenation of learnable persistent memory vectors. These allow the architecture to encode a priori knowledge such as ‘eggs’ and ‘toast’ make up the concept ‘breakfast’. When decoding the output of the encoder, a stack of self-attention layers is also used. Each decoder layer is connected via a gated cross-attention mechanism to each of the encoder layers, giving way to the ‘meshed’ concept of the paper. The output of the decoder block is used to generate the final output caption.

The work of Herdade et al. [119] modifies the attention weight matrix in order to incorporate relative geometric relationships between detected objects. These geometric relationships are defined using a displacement vector that characterises the difference in geometry between two bounding boxes. The work allows the Transformer-based architecture to incorporate geometric relationships directly into the attention mechanism, a relationship not considered by other Transformer-based image captioning techniques such as [59].

The authors of [88] propose using a novel similarity (referred to as a semantic in the paper) and topic graphs. Built on dot product similarity, the graphs are produced without the requirement of graph extraction models such as MotifNet [114]. Rather, a set of vertices \(V = \{v_i \in \mathbb {R}^{d_{obj}}\}^{n_{obj}}_{i=1}\) are extracted as ResNet features from a Faster-RCNN object detector [78]. Edges in the adjacency matrix are then populated using the dot product between the feature vectors in V with \(a_{ij} = \sigma (v_i^T \textbf{M}v_j)\), where \(\textbf{M}\) is a matrix of learnable weights and \(\sigma \) is a nonlinear activation function. Once both graphs have been constructed, a GCN is applied to both in order to enrich the nodes with local context. A graph self-attention mechanism is then applied to ensure nodes are not just accounting for their immediate neighbours. The improved graphs are then decoded via an LSTM to generate captions.

Following [39], Dong et al. [49, 84, 103].

5.1 VQA

Originally proposed in [49], VQA has developed beyond simple ‘yes’ or ‘no’ answers to richer natural language answers. A common thread of work is to leverage the multi-modal aspect of VQA and utilise both visual features from the input image and textual features from the question [84, 85, 103].

One of the first works in VQA to make use of GNNs was that of Teney et al. [84]. Their work is based on the clip art focused dataset [49]. Their model takes a visual scene graph as input alongside a question. The question is then parsed into a textual scene graph using the Stanford Dependency Parser [83]. These scene graphs are then processed independently using a GGNN [100] modified to incorporate an attention mechanism. The original feature vectors are then combined using an attention mechanism that reflects how relevant two nodes from the scene graphs are to one another.

Khademi [103] takes a multi-modal approach to VQA by using dense region captions alongside extracted visual features. Given a query and input image, the model will first extract visual regions using a Faster-RCNN object detector and generated a set of features using ResNet and encoding the bounding box information into these features. An off-the-shelf dense region captioning model is also used to create a set of captions and associated bounding boxes. The captions and bounding box information are encoded using a GRU. Each set of features is turned into a graph (visual and textual, respectively) with outgoing and incoming edges existing between features if the Euclidean distance between the centre of the normalised bounding boxes is less than \(\gamma = 0.5\). Both graphs are processed by a GGNN with updated features being used to update an external spatial memory unit—thus making the network a graph memory network (described in Sect. 3.4). After propagating the node features, the final state of the external spatial memory network is turned into a complete graph using each location as a node. This final graph is processed by a GGNN to produce the final answer. The multi-modal approach presented in this paper is shown to be highly effective when compared to similar VQA methods. This approach is shown to work extremely well in benchmarks, with the proposed MN-GMN architecture [103] performing favourably with comparable models (Table 4).

MORN [85] is another work that focuses on capturing the complex multi-modal relationships between the question and image. Like many recent works in deep learning, it adopts the Transformer [40] architecture. Built with three main components, the model first creates a visual graph of the image starting from a fully connected graph of detected objects and a GCN is used to aggregate the visual features. The second part of the model creates a textual scene graph from the input question. Both graphs are merged together by the final component of the model, a relational multi-modal Transformer, which is used to align the representations.

Sharma et al. [120] follow the Vision-Language multi-modal approach but diverge from the use of a textual semantic graph and instead opt to use word embeddings. The authors utilise a novel GGNN-based architecture that processes an undirected complete graph of nodes representing visual features. Nodes are weighted with the probability that a relationship occurs between them. In line with other VQA work [103], the question is capped to 14 words, with each one being converted into GloVe embeddings [121]. Questions with fewer than 14 words are padded with zero vectors. A question embedding is then generated using a GRU applied to the word embeddings. An LSTM-based attention mechanism considers both the question vector and the visual representations making up the nodes of the scene graph. This module considers previously attended areas when exploring new visual features. Finally, an LSTM-based language generator is used to generate the final answer. Another work to forgo using a textual scene graph, Zhang et al. [55] make use of word vectors to embed information about the image into a semantic graph. Using a GNN, they are able to create enriched feature vectors representing the nodes, edges, and an image feature vector representing the global state. They include the question into the image feature by averaging the word vectors, which enables the GNN to reason about the image. Whilst both [120] and [55] yield good results, by only using word- or sentence-level embeddings and not using a textual scene graph, they fail to model relationships in the textual domain. This therefore removes the ability for the models to reason in that domain alone.

Both Li et al. [122] and Nuthalapati et al. [123] take a different route to the established multi-modal approach and instead use different forms of visual information. Li et al. [122] take inspiration from [39] and make use of both semantic and spatial graphs to represent the image. In addition to these explicit graphs, they also introduce an implicit graph, i.e. a fully connected graph between the detected objects with edge weights set by a GAT. The relation-aware visual features are then combined with the question vector using multi-modal fusion. The fused output is then used to predict an answer via an MLP.

Nuthalapati et al. [123] use a dual scene graph approach, using both visual and semantic graphs. These graphs are merged into a single graph embedding using a novel GAT architecture [102] that is able to attend to edges as well as nodes. The graphs are enriched with negative entities that appear in the question but not the graph. Pruning then takes place to remove nodes and edges that are K hops away from features mentioned in the question. A decoder is then used to produce an answer to the inputted question.

Table 4 A table showing the model details and VQA [49] Test-Dev results of selected VQA models

5.2 Knowledge-/fact-based VQA

Knowledge- or fact-based VQA is the challenging task of making use of external knowledge given in knowledge graphs such as WikiData [70] to answer questions about an image. The major challenge of this task is to create a model that can make use of all three mediums (image, question, and fact) to generate an appropriate answer. The MUCKO [124] architectural diagram shown in Fig. 4 (reused with permission), is shown as a representative example of models that approach FVQA.

Fig. 4
figure 4

MUCKO architecture [124] (reused with permission). Best viewed in colour

In [125], the authors present a novel GCN-based architecture for FVQA. Alongside the question and answer sets, a knowledge base of facts is also included, \(KB = \{f_1, f_2,..., f_{|KB|}\}\). Each fact \(f = (x, r, y)\) is formed of a visual concept grounded in the image (x), an attribute or phrase (y), and a relation linking the two r. Relationships exist in a predefined set of 13 different ways a concept and attribute can be related. Their work first reduces the search space to the 100 facts most likely to contain the correct answer by using GloVe embeddings [121] of words in the question and facts before further reducing it to the most relevant facts \(f_{rel}\). These most relevant facts are turned into a graph where all the visual concepts and attributes from \(f_{rel}\) form the nodes. An edge joins two nodes if they are related by a fact in \(f_{rel}\). A GCN is then used to ‘reason’ over the graph to predict the final answer. Using a message passing architecture, the authors are able to update the feature representations of the nodes which can then be fed into an MLP which predicts a binary label corresponding to whether or not the entity contains the answer.

Zhu et al. [124] use a multi-modal graph approach to representing images with a visual, semantic, and knowledge graph. After graph construction, GCNs are applied to each modality to create richer feature embeddings. These embeddings are then processed in a cross-modal manner. Visual–fact aggregation and semantic–fact aggregation operations produce complimentary information which is then used with a fact–fact convolutional layer. This final layer takes into account all three modalities and produces an answer that considers the global context. The authors continue their work in [77] by changing the cross-modal mechanism for a novel GRUC (Graph-based Read, Update, and Control) mechanism. The GRUC operates in a parallel pipeline. One pipeline starts with a concept from the knowledge graph and recurrently incorporates knowledge from the visual graph. Another starts with the same knowledge graph concept but incorporates semantic knowledge. At the end of the recurrent operations, the outputs of the two pipelines are fused together with the question and original fact node. This fused feature is then used to predict the final answer. The change made to the cross-modal attention mechanism yields significant improvements in the FVQA benchmark when compared with MUCKO [124].

Liu et al. [126] also adopt a multi-modal approach, but use only the semantic and knowledge modalities. They propose a dual process system for FVQA that is based on the dual process theory from cognitive science [127]. Their approach utilises a BERT encoder to represent the input question and a Faster-RCNN [78]-based feature extractor to represent the image features. The first of the two systems, based on the Transformer architecture [40], joins these two representations into a single multi-modal representation. The second system then develops a semantic graph by turning dense region captions into textual scene graphs (using SPICE), as well as a knowledge graph generated using the question input. A message passing GNN is then used to identify the important nodes and aggregate information between them using an attention weighting. A joint representation for each knowledge graph node is then learned by combining the whole semantic graph with the node with relation to an attention weighting. This joint representation is then used to predict the final answer.

The GNN-VQA proposed in [128] makes use of a bidirectional GNN that fuses structured and unstructured multi-modal data through a ‘supernode’. After extracting a semantic graph, they use a pretrained sentence BERT model to calculate the similarity between the potential answers and the Visual Genome region descriptions. The top-10 region descriptions in terms of similarity are averaged and used to define a concept node which is connected to each visual node of the semantic graph. Concepts are then extracted from the ConceptNet KG [67] using the labels of detected objects in the image and concepts extracted from potential answers. A GAT-based GNN is then used to construct better node representations which are then used to select the correct answer.

Moving away from the multi-modal approach, SGEITL [129] makes a semantic graph of the image and then follows Yang et al. [54] and introduces skip edges to the graph, essentially making it a complete graph. This graph then goes through a multi-hop graph Transformer, which masks the attention between nodes based on their distance, ensuring that only close by nodes are attended to. Through their work, they demonstrate that structural information is useful when approaching the complex VQA task.

With their TRiG model, Gao et al. [\(q(\textbf{x})\) maps the vector to a codeword \(c_i\). These codewords make up a set of length K known as a code book; thus, \(q(x) \in \mathcal {C}=\{c_i; i \in \{0... (K-1)\}\}\). The learned code words are combined with image features to form landmark graph—based on the similarity graph except the graph also has nodes learned through the quantisation process. Once the landmark graph has been constructed, a GCN is use to propagate features with the objective of moving similar images closer together in the feature space. The use of vector quantisation allows for the landmark graph to exist in a lower-dimensional space, reducing computation when computing which images from the graph to return as candidates.

The authors of [130] showed reasoning in the text space improved performance over graph-based reasoning in VQA.

In summary, Vision-Language tasks such as those discussed in this paper are set to have a fruitful future, with many opportunities for various graph structures to be exploited.