Unpaired Image Captioning by Language Pivoting

Gu, Jiuxiang; Joty, Shafiq; Cai, Jianfei; Wang, Gang

doi:10.1007/978-3-030-01246-5_31

Jiuxiang Gu¹⁷,
Shafiq Joty¹⁸,
Jianfei Cai¹⁸ &
…
Gang Wang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11205))

Included in the following conference series:

European Conference on Computer Vision

4636 Accesses
37 Citations

Abstract

Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a map** from the image to its natural language description. In general, the map** function is learned from a training set of image-caption pairs. However, for some language, large scale image-caption paired corpus might not be available. We present an approach to this unpaired image captioning problem by language pivoting. Our method can effectively capture the characteristics of an image captioner from the pivot language (Chinese) and align it to the target language (English) using another pivot-target (Chinese-English) sentence parallel corpus. We evaluate our method on two image-to-English benchmark datasets: MSCOCO and Flickr30K. Quantitative comparisons against several baseline approaches demonstrate the effectiveness of our method.

You have full access to this open access chapter, Download conference paper PDF

Object-Centric Unsupervised Image Captioning

FocusCap: Object-Focused Image Captioning with CLIP-Guided Language Model

Improving German Image Captions Using Machine Translation and Transfer Learning

Keywords

1 Introduction

Recent several years have witnessed unprecedented advancements in automatic image caption generation. This progress can be attributed (i) to the invention of novel deep learning framework that learns to generate natural language descriptions of images in an end-to-end fashion, and (ii) to the availability of large annotated corpora of images paired with captions such as MSCOCO [30] to train these models. The dominant methods are based on an encoder-decoder framework, which uses a deep convolutional neural network (CNN) to encode the image into a feature vector, and then use a recurrent neural network (RNN) to generate the caption from the encoded vector [27, 29, 44]. More recently, approaches of using attention mechanisms and reinforcement learning have dominated the MSCOCO captioning leaderboard [1, 18, 39].

Despite the impressive results achieved by the deep learning framework, one performance bottleneck is the availability of large paired datasets because neural image captioning models are generally annotation-hungry requiring a large amount of annotated image-caption pairs to achieve effective results [19]. However, in many applications and languages, such large-scale annotations are not readily available, and are expensive and slow to acquire. In these scenarios, unsupervised methods that can generate captions from unpaired data or semi-supervised methods that can exploit paired annotations from other domains or languages are highly desirable [5]. In this paper, we pursue the later research avenue, where we assume that we have access to image-caption paired instances in one language (Chinese), and our goal is to transfer this knowledge to a target language (English) for which we do not have such image-caption paired datasets. We also assume that we have access to a separate source-target (Chinese-English) parallel corpus to help us with the transformation. In other words, we wish to use the source language (Chinese) as a pivot language to bridge the gap between an input image and a caption in the target language (English).

The concept of using a pivot language as an intermediary language has been studied previously in machine translation (MT) to translate between a resource-rich language and a resource-scarce language [6, 25, 42, 46]. The translation task in this strategy is performed in two steps. A source-to-pivot MT system first translates a source sentence into the pivot language, which is in turn translated to the target language using a pivot-to-target MT system. Although related, image captioning with the help of a pivot language is fundamentally different from MT, since it involves putting together two different tasks – captioning and translation. In addition, the pivot-based pipelined approach to MT suffers from two major problems when it comes to image captioning. First, the conventional pivot-based MT methods assume that the datasets for source-to-pivot and pivot-to-target translations come from the same (or similar) domain(s) with similar styles and word distributions. However, as it comes to image captioning, captions in the pivot language (Chinese) and sentences in the (Chinese-English) parallel corpus are quite different in styles and word distributions. For instance, MSCOCO captioning dataset mostly consists of images of a large scene with object instances (nouns), whereas language parallel corpora are more generic. Second, the errors made in the source-to-pivot translation get propagated to the pivot-to-target translation module in the pipelined approach.

In this paper, we present an approach that can effectively capture the characteristics of an image captioner from the source language and align it to the target language using another source-target parallel corpus. More specifically, our pivot-based image captioning framework comprises an image captioner image-to-pivot, an encoder-decoder model that learns to describe images in the pivot language, and a pivot-to-target translation model, another encoder-decoder model that translates the sentence in pivot language to the target language, and these two models are trained on two separate datasets. We tackle the variations in writing styles and word distributions in the two datasets by adapting the language translation model to the captioning task. This is achieved by adapting both the encoder and the decoder of the pivot-to-target translation model. In particular, we regularize the word embeddings of the encoder (of pivot language) and the decoder (of target language) models to make them similar to image captions. We also introduce a joint training algorithm to connect the two models and enable them to interact with each other during training. We use AIC-ICC [

$$\begin{aligned} y \sim \arg \max _{y}\big \{P(y|i; \mathbf {\theta }_{i \rightarrow y})\big \} \end{aligned}$$

(1)

where $\mathbf {\theta }_{i \rightarrow y}$ are the model parameters to be learned in the absence of any paired data, $i^{(n_i)} \nleftrightarrow y^{(n_y)}$. We use the pivot language x to learn the map**: $i \xrightarrow []{\theta _{i\rightarrow x}} x \xrightarrow []{\theta _{x \rightarrow y}} y$. Note that image-to-pivot (${D_{i,x}}$) and pivot-to-target (${D_{x,y}}$) in our setting are two distinct datasets with possibly no common elements.

Figure 1 illustrates our pivot-based image captioning approach. We have an image captioning model $P(x|i; \theta _{i\rightarrow x})$ to generate a caption in the pivot language from an image and a NMT model $P(y|x; \theta _{x\rightarrow y})$ to translate this caption into the target language. In addition, we have an autoencoder in the target language $P(\hat{y}|\hat{y}; \theta _{\hat{y}\rightarrow \hat{y}})$ that guides the target language decoder to produce caption-like sentences. We train these components jointly so that they interact with each other. During inference, given an unseen image i to be described, we use the joint decoder:

$$\begin{aligned} y \sim \arg \max _{y}\big \{P(y|i; \mathbf {\theta }_{i \rightarrow x}, \mathbf {\theta }_{x \rightarrow y}) \big \} \end{aligned}$$

(2)

In the following, we first give an overview of neural methods for image captioning and machine translation using paired (parallel) data. Then, we present our approach that extends these standard models for unpaired image captioning with a pivot language.

3.1 Encoder-Decoder Models for Image Captioning and Machine Translation

Standard Image Captioning. For image captioning in the paired setting, the goal is to generate a caption $\tilde{x}$ from an image i such that $\tilde{x}$ is as similar to the ground truth caption x. We use $P_x(x|i; \mathbf {\theta }_{i \rightarrow x})$ to denote a standard encoder-decoder based image captioning model with $\theta _{\text {i}\rightarrow \text {x}}$ being the parameters. We first encode the given image to the image features v with a CNN-based image encoder: $v=\text {CNN}(i)$. Then, we predict the image description x from the global image feature v. The training objective is to maximize the probability of the ground truth caption words given the image:

$$\begin{aligned} \tilde{\mathbf {\theta }}_{i \rightarrow x}&= \arg \max _{\mathbf {\theta }_{i \rightarrow x}}\big \{ \mathcal {L}_{i \rightarrow x} \big \} \end{aligned}$$

(3)

$$\begin{aligned}&=\arg \max _{\mathbf {\theta }_{i \rightarrow x}}\big \{ \sum _{n_i=0}^{N_i-1} \sum _{t=0}^{M^{(n_i)}-1} \log P_x(x_t^{(n_i)}|x^{(n_i)}_{0:t-1},i^{(n_i)}; \mathbf {\theta }_{i \rightarrow x}) \big \} \end{aligned}$$

(4)

where $N_i$ is the number of image-caption pairs, ${M^{(n_i)}}$ is the length of the caption $x^{(n_i)}$, $x_t$ denotes a word in the caption, and $P_x(x_t^{(n_i)}|x^{(n_i)}_{0:t-1},i^{(n_i)})$ corresponds to the activation of the Softmax layer. The decoded word is drawn from:

$$\begin{aligned} x_t\sim \arg \max _{\mathcal {V}_{i \rightarrow x}^{x}} P(x_t|x_{0:t-1};i) \end{aligned}$$

(5)

where $\mathcal {V}_{i \rightarrow x}^{x}$ is the vocabulary of words in the image-caption dataset $D_{i,x}$.

Neural Machine Translation. Given a pair of source and target sentences (x, y), the NMT model $P_y(y|x; \mathbf {\theta }_{x \rightarrow y})$ computes the conditional probability:

$$\begin{aligned} P_y(y|x)=\prod _{t=0}^{N-1}P(y_t|y_{0:t-1};x_{0:M-1}) \end{aligned}$$

(6)

where M and N are the lengths of the source and target sentences, respectively. The maximum-likelihood training objective of the model can be expressed as:

$$\begin{aligned} \tilde{\mathbf {\theta }}_{x \rightarrow y}&= \arg \max _{\mathbf {\theta }_{x \rightarrow y}}\big \{ \mathcal {L}_{x \rightarrow y} \big \} \end{aligned}$$

(7)

$$\begin{aligned}&=\arg \max _{\mathbf {\theta }_{x \rightarrow y}}\Big \{ \sum _{n_x=0}^{N_x-1} \sum _{t=0}^{N^{(n_x)}-1} \log P_y(y_t^{(n_x)}|y^{(n_x)}_{0:t-1}; x^{(n_x)}; \mathbf {\theta }_{x \rightarrow y})\big \} \end{aligned}$$

(8)

During inference we calculate the probability of the next symbol given the source sentence encoding and the decoded target sequence so far, and draw the word from the dictionary according to the maximum probability:

$$\begin{aligned} y_t\sim \arg \max _{\mathcal {V}_{x \rightarrow y}^{y}} P(y_t|y_{0:t-1};x_{0:M-1}) \end{aligned}$$

(9)

where $\mathcal {V}_{x \rightarrow y}^{y}$ is the vocabulary of the target language in the translation dataset $D_{x,y}$.

Unpaired Image Captioning by Language Pivoting. In the unpaired setting, our goal is to generate a description y in the target language for an image i without any pair information. We assume, there is a second language x called “pivot” for which we have (separate) image-pivot and pivot-target paired datasets. The image-to-target model in the pivot-based setting can be decomposed into two sub-models by treating the pivot sentence as a latent variable:

$$\begin{aligned}&P(y|i; \mathbf {\theta }_{i \rightarrow x}, \mathbf {\theta }_{x \rightarrow y}) = \sum _{x} P_x(x|i; \mathbf {\theta }_{i \rightarrow x}) P_y(y|x; \mathbf {\theta }_{x \rightarrow y}) \end{aligned}$$

(10)

where $P_x(x|i; \mathbf {\theta }_{i \rightarrow x})$ and $P_y(y|x; \mathbf {\theta }_{x \rightarrow y})$ are the image captioning and NMT models, respectively. Due to the exponential search space in the pivot language, we approximate the captioning process with two steps. The first step translates the image i into a pivot language sentence $\tilde{x}$. Then, the pivot language sentence is translated to a target language sentence $\tilde{y}$. To learn such a pivot-based model, a simple approach is to combine the two loss functions in Eqs. (3) and (7) as follows:

$$\begin{aligned} \mathcal {J}_{i \rightarrow x, x \rightarrow y} = \mathcal {L}_{i \rightarrow x} + \mathcal {L}_{x \rightarrow y} \end{aligned}$$

(11)

During inference, the decoding decision is given by:

$$\begin{aligned} \tilde{x} =&\arg \max _{x}\big \{ P_x(x|i; \tilde{\mathbf {\theta }}_{i \rightarrow x}) \big \} \end{aligned}$$

(12)

$$\begin{aligned} \tilde{y} =&\arg \max _{y}\big \{ P_y(y|\tilde{x}; \tilde{\mathbf {\theta }}_{x \rightarrow y}) \big \} \end{aligned}$$

(13)

where $\tilde{x}$ is the image description generated from i in the pivot language, $\tilde{y}$ is the translation of $\tilde{x}$, and $\tilde{\theta }_{i\rightarrow x}$ and $\tilde{\theta }_{x\rightarrow y}$ are the learned model parameters.

However, this pipelined approach to image caption generation in the target language suffers from couple of key limitations. First, image captioning and machine translation are two different tasks. The image-to-pivot and pivot-to-target models are quite different in terms of vocabulary and parameter space because they are trained on two possibly unrelated datasets. Image captions contain description of objects in a given scene, whereas machine translation data is more generic, in our case containing news event descriptions, movie subtitles, and conversational texts. They are two different domains with differences in writing styles and word distributions. As a result, the captions generated by the pipeline approach may not be similar to human-authored captions. Figure 1 distinguishes between the two domains of pivot and target sentences: caption domain and translation domain (see second and third circles). The second limitation is that the errors made by the image-to-pivot captioning model get propagated to the pivot-to-target translation model.

To overcome the limitations of the pivot-based caption generation, we propose to reduce the discrepancy between the image-to-pivot and pivot-to-target models, and to train them jointly so that they learn better models by interacting with each other during training. Figure 2 illustrates our approach. The two models share some common aspects that we can exploit to connect them as we describe below.

Connecting Image-to-Pivot and Pivot-to-Target. One way to connect the two models is to share the corresponding embedding matrices by defining a common embedding matrix for the decoder of image-to-pivot and the encoder of pivot-to-target. However, since the caption and translation domains are different, their word embeddings should also be different. Therefore, rather than having a common embedding matrix, we add a regularizer $\mathcal {R}_{i \rightarrow y}$ that attempts to bring the input embeddings of the NMT model close to the output embeddings of image captioning model by minimizing their $l_2$ distance. Formally,

$$\begin{aligned} \mathcal {R}_{i \rightarrow y}(\theta _{i\rightarrow x}^{w_x}, \theta _{x\rightarrow y}^{w_x}) = - \sum _{w_x \in \mathcal {V}^{x}_{i \rightarrow x} \cap \mathcal {V}^{x}_{x \rightarrow y}} || \mathbf {\theta }^{w_x}_{i \rightarrow x} - \mathbf {\theta }^{w_x}_{x \rightarrow y} ||_2 \end{aligned}$$

(14)

where $w_x$ is a word in the pivot language that is shared by the two embedding matrices, and $\mathbf {\theta }_{i \rightarrow x}^{w_x} \in \mathbb {R}^{d}$ denotes the vector representation of $w_x$ in the source-to-pivot model, and $\mathbf {\theta }_{x \rightarrow y}^{w_x} \in \mathbb {R}^{d}$ denotes the vector representation of $w_x$ in the pivot-to-target model. Note that, here we adapt $\mathbf {\theta }_{x \rightarrow y}^{w_x}$ towards $\mathbf {\theta }_{i \rightarrow x}^{w_x}$, that is, $\mathbf {\theta }_{i \rightarrow x}^{w_x}$ is already a learned model and kept fixed during adaptation.

Adapting the encoder embeddings of the NMT model does not guarantee that the decoder of the model will produce caption-like sentences. For this, we need to also adapt the decoder embeddings of the NMT model to the caption data. We first use the target-target parallel corpus $D_{\hat{y},\hat{y}}=\{(\hat{y}^{(n_{\hat{y}})},\hat{y}^{(n_{\hat{y}})})\}_{n_{\hat{y}=0}}^{N_{\hat{y}}-1}$ to train an autoencoder $P(\hat{y}|\hat{y};\theta _{\hat{y}\rightarrow \hat{y}})$, where $\theta _{\hat{y}\rightarrow \hat{y}}$ are the parameters of the autoencoder. The maximum-likelihood training objective of autoencoder can be expressed as:

$$\begin{aligned} \tilde{\mathbf {\theta }}_{\hat{y}\rightarrow \hat{y}} = \arg \max _{\mathbf {\theta }_{\hat{y}\rightarrow \hat{y}}}\big \{ \mathcal {L}_{\hat{y} \rightarrow \hat{y}} \big \} \end{aligned}$$

(15)

where $\mathcal {L}_{\hat{y} \rightarrow \hat{y}}$ is the cross-entropy (XE) loss. The autoencoder then “teaches” the decoder of the translation model $P(y|x;\theta _{x\rightarrow y})$ to learn similar word representations. This is again achieved by minimizing the $l_2$ distance between two vectors:

$$\begin{aligned} \mathcal {R}_{x\rightarrow \hat{y}}(\theta _{x\rightarrow y}^{w_y}, \theta _{\hat{y}\rightarrow \hat{y}}^{w_y}) = - \sum _{w_y \in \mathcal {V}^{y}_{x \rightarrow y} \cap \mathcal {V}^{y}_{\hat{y} \rightarrow \hat{y}}} || \mathbf {\theta }^{w_y}_{x \rightarrow y} - \mathbf {\theta }^{w_y}_{\hat{y} \rightarrow \hat{y}}||_2 \end{aligned}$$

(16)

where $\mathcal {V}^{y}_{\hat{y} \rightarrow \hat{y}}$ is the vocabulary of y in $D_{\hat{y},\hat{y}}$, and $w_y$ is a word in the target language that is shared by the two embedding matrices. By optimizing Eq. 16, we try to make the learned caption share a similar style as the target captions.

Joint Training. In training, our goal is to find a set of source-to-target model parameters that maximizes the training objective:

$$\begin{aligned} \mathcal {J}_{i \rightarrow x, x \rightarrow y, y \rightarrow \hat{y}}&= \mathcal {L}_{i \rightarrow x} + \mathcal {L}_{x \rightarrow y}+\mathcal {L}_{\hat{y} \rightarrow \hat{y}}+\lambda \mathcal {R}_{i \rightarrow x, x \rightarrow y, y \rightarrow \hat{y}} \end{aligned}$$

(17)

$$\begin{aligned} \mathcal {R}_{i \rightarrow x, x \rightarrow y, y \rightarrow \hat{y}}&= \mathcal {R}_{i \rightarrow y}(\theta _{i\rightarrow x}^{w_x}, \theta _{x\rightarrow y}^{w_x})+ \mathcal {R}_{x \rightarrow \hat{y}}(\theta _{x\rightarrow y}^{w_y}, \theta _{\hat{y} \rightarrow \hat{y}}^{w_y}) \end{aligned}$$

(18)

where $\lambda $ is the hyper-parameter used to balance the preference between the loss terms and the connection terms. Since both the captioner $P_x(x|i;\theta _{i\rightarrow x})$ and the translator $P_y(y|x;\theta _{x\rightarrow y})$ have large vocabulary sizes (see Table 1), it is hard to train the joint model with an initial random policy. Thus, in practice, we pre-train the captioner, translator and autoencoder first, and then jointly optimize them with Eq. (17).

4 Experiments

Datasets. In our experiments, we choose the two independent datasets used from AI Challenger (AIC) [47]: AIC Image Chinese Captioning (AIC-ICC) and AIC Chinese-English Machine Translation (AIC-MT), as the training datasets, while using MSCOCO and Flickr30K English captioning datasets as the test datasets. Table 1 shows the statistics of the datasets used in our experiments.

Table 1. Statistics of the datasets used in our experiments, where “im” denotes the image,“zh” denotes Chinese, and “en” denotes English.

Full size table

Training Datasets. For image-to-Chinese captioning training, we follow the settings in AIC-ICC [27]. For MSCOCO, we use 5,000 images for validation and 5,000 images for testing, and for Flickr30K, we use 1,000 images for testing.

4.1 Implementation Details

Architecture. As can be seen in Fig. 2, we have three models used in our image captioner. The first model $\text {i2t}_{\text {im}\rightarrow \text {zh}}$ learns to generate the Chinese caption x from a given image i. It is a standard CNN-RNN architecture [44], where word outputted from the previous time step is taken as the input for the current time step. For each image, we encoder it with ResNet-101 [21], and then apply average pooling to get a vector of dimensions 2,048. After that, we map the image features through a linear projection and get a vector of dimensions 512. The decoder is implemented based on an LSTM network. The dimensions of the LSTM hidden states and word embedding are fixed to 512 for all of the models discussed in this paper. Each sentence starts with a special BOS token, and ends with an EOS token.

The second model $\text {nmt}_{\text {zh}\rightarrow \text {en}}$ learns to translate the Chinese sentence x to the English sentence y. It has three components: a sentence encoder, a sentence decoder, and an attention module. The words in the pivot language are first mapped to word vectors and then fed into a bidirectional LSTM network. The decoder predicts the target language words based on the encoded vector of the source sentence as well as its previous outputs. The encoder and the decoder are connected through an attention module which allows the decoder to focus on different regions of the source sentence during decoding.

The third model $\text {t2t}_{\text {en}\rightarrow \text {en}}$ learns to produce the caption-style English sentence $\hat{y}$. It is essentially an autoencoder trained on a set of image descriptions extracted from MSCOCO, where the encoder and the decoder are based on one-layer LSTM network. The encoder reads the whole sentence as input and the decoder is to reconstruct the input sentence.

Training Setting. All the modules are randomly initialized before training except the image CNN, for which we use a pre-trained model on ImageNet. We first independently train the image Chinese captioner, the Chinese-to-English translator, and the autoencoder with the cross-entropy loss on AIC-ICC, AIC-MT, and MSCOCO corpus, respectively. During this stage, we use Adam [28] algorithm to do model updating with a mini-batch size of 100. The initial learning rate is $4e^{-4}$, and the momentum is 0.9. The best models are selected according to the validation scores, which are then used for the subsequent joint training. Specifically, we combine the just trained models with the connection terms, and conduct a joint training with Eq. (17). We set the hyper-parameter $\lambda $ to 1.0, and train the joint model using Adam optimizer with a mini-batch size of 64 and an initial learning rate of $2e^{-4}$. Weight decay and dropout are applied in this training phase to prevent over-fitting.

Testing setting. During testing, the output image description is first formed by drawing words in pivot language from $\text {i2t}_{\text {im}\rightarrow \text {zh}}$ until an EOS token is reached, and then translated with $\text {nmt}_{\text {zh}\rightarrow \text {en}}$ to the target language. Here we use beam search for the two inference procedures. Beam search is an efficient decoding method for RNN-based models, which keeps the top-k hypotheses at each time step, and considers them as the candidates to generate a new top-k hypotheses at the next time step. We set a fixed beam search size of $k=5$ for $\text {i2t}_{\text {im}\rightarrow \text {zh}}$ and $k=10$ for $\text {nmt}_{\text {zh}\rightarrow \text {en}}$. We evaluate the quality of the generated image descriptions with the standard evaluation metrics: BLEU [36], METEOR [12], and CIDEr [43]. Since BLEU aims to assess how similar two sentences are, we also evaluate the diversity of the generated sentence with Self-BLEU [54], which takes one sentence as the hypothesis and the others as the reference, and then calculates BLEU score for every generated sentence. The final Self-BLEU score is defined as the average BLEU scores of the sentences.

4.2 Quantitative Analysis

Results of Image Chinese Captioning. Table 2 shows the comparison results on the AIC-ICC validation set, where B@n is short for BLEU-n. We compare our $\text {i2t}_{\text {im}\rightarrow \text {zh}}$ model with the baseline [47] (named AIC-I2T). Both AIC-I2T and our image caption model ($\text {i2t}_{\text {im}\rightarrow \text {zh}}$) are trained with cross-entropy loss. We can see that our model outperforms the baseline in all the metrics. This might be due to different implementation details, e.g., AIC-I2T utilizes Inception-v3 for the image CNN while we use ResNet-101.

Table 2. Performance comparisons on AIC-ICC. The results of $\text {i2t}_{\text {im}\rightarrow \text {zh}}$ are achieved via beam search.

Full size table

Table 3. Performance comparisons on AIC-MT test dataset. Note that our $\text {nmt}_{\text {zh}\rightarrow \text {en}}$ model uses beam search.

Full size table

Results of Chinese-to-English Translation. Table 3 provides the comparison between our attention-based machine translator with the online Google translator on AIC-MT test split. We use the “googletrans”^{Footnote 2}, a free Python tool that provides Google translator API. The perplexity value in the second column is the geometric mean of the inverse probability for each predicted word. Our attention-based NMT model ($\text {nmt}_{\text {zh}\rightarrow \text {en}}$) is trained on AIC-MT training set. We can see that our model is slightly worse than online Google translation in accuracy. This not surprising considering that Google’s translator is trained on much larger datasets with more vocabulary coverage, and it is a more complex system that ensembles multiple NMT models.

Results of Unpaired Image English Captioning. Table 4 shows the comparisons among different variants of our method on MSCOCO dataset. Our upper bound is achieved by an image captioning model $\text {i2t}_{\text {im}\rightarrow \text {en}}$ that is trained with paired English captions. $\text {i2t}_{\text {im}\rightarrow \text {en}}$ shares the same architecture as $\text {i2t}_{\text {im}\rightarrow \text {zh}}$, except that they have different vocabulary sizes. The lower bound is achieved by pipelinining $\text {i2t}_{\text {im}\rightarrow \text {zh}}$ and $\text {nmt}_{\text {zh}\rightarrow \text {en}}$. In the pipeline setting, these two models are trained on AIC-ICC and AIC-MT, respectively. We also report the results of our implementation of FC-2K [39], which adopts a similar architecture.

Table 4. Results of unpaired image-to-English captioning on MSCOCO 5K and Flickr30K 1K test splits, where M is short for METEOR.

Full size table

For unpaired image-to-English captioning, our method with the connection term on pivot language ($\mathcal {R}_{i\rightarrow x, x\rightarrow y}^{w_x}$) outperforms the method of combining $\text {i2t}_{\text {im}\rightarrow \text {zh}}$ with online Google translation in terms of B@n and CIDEr metrics, while obtaining significant improvements over the lower bound. This demonstrates the effectiveness of the connection term on the pivot language. Moreover, by adding the connection term on the target language, our model with the two connection terms ($\mathcal {R}_{i\rightarrow x, x\rightarrow y, {y}\rightarrow \hat{y}}$) further improves the performance. This suggests that a small corpus in the target domain is able to make the decoder to generate image descriptions that are more like captions. The connection terms help to bridge the word representations of the two different domains. The captions generated by Google translator have higher METEOR. We speculate the following reasons. First, Google Translator generates longer captions than ours. Since METEOR computes the score not only on the basis of n-gram precision but also of uni-gram recall, its default parameters favor longer translations than other metrics [4]. Second, in addition to exact word matching, METEOR considers matching of word stems and synonyms. Since Google translator is trained on a much larger corpus than ours, it generates more synonymous words. Table 4 also shows the results of unpaired image English captioning on Flickr30K, where we can draw similar conclusions.

We further evaluate the diversity of the generated image descriptions using Self-BLEU metric. Table 5 shows the detailed Self-BLEU scores. It can be seen that our method generates image descriptions with the highest diversity, compared with the upper and lower bounds. For better comparison, we also calculate the Self-BLEU scores calculated on ground truth captions.

Table 5. Self-BLEU scores on MSCOCO 5K test split. Note that lower Self-BLEU scores imply higher diversity of the image descriptions.

Full size table

We also conduct a human evaluation of the generated captions for different models as well as the ground truth captions. A total number of 12 evaluators of different educational background were invited, and a total of 1.2 K samples were randomly selected from the test split for the user study. Particularly, we measure the caption quality from two aspects: relevant and resemble. The relevant metric indicates whether the caption is correct according to the image content. The resemble metric assesses to what extent the systems produce captions that resemble human-authored captions. The evaluators assessed the quality in 5 grades: 1-very poor, 2-poor, 3-barely acceptable, 4-good, 5-very good. Each evaluator assessed randomly chosen 100 images. The results presented in Table 6 demonstrate that our approach can generate relevant and human understandable image captions as the paired (Upper bound) approach.

Table 6. Evaluation results of user assessment on MSCOCO 1.2 test split.

Full size table

4.3 Qualitative Results

We provide some captioning examples in Fig. 3 for a better understanding of our model. We show in different color the generated captions for several images by the three models along with the ground truth (GT) captions. From these exemplary results, we can see that, compared with the paired model $\text {i2t}_{\text {im}\rightarrow \text {en}}$, our pivot-based unpaired model $\text {i2t}_{\text {im}\rightarrow \text {zh}\rightarrow \text {en}} (\mathcal {R}_{i\rightarrow x, x\rightarrow y, {y}\rightarrow \hat{y}})$ often generates more diverse captions; thanks to the additional translation data. At the same time, our model can generate caption-like sentences by bridging the gap between the datasets and by joint training of the model components. For example, with the detected people in the first image, our model generates the sentence with “a bunch of people in sports suits”, which is more diverse than the sentence with “a group of baseball players” generated by the paired model.

5 Conclusion

In this paper, we have proposed an approach to unpaired image captioning with the help of a pivot language. Our method couples an image-to-pivot captioning model with a pivot-to-target NMT model in a joint learning framework. The coupling is done by adapting the word representations in the encoder and the decoder of the NMT model to produce caption-like sentences. Empirical evaluation demonstrates that our method consistently outperforms the baseline methods on MSCOCO and Flickr30K image captioning datasets. In our future work, we plan to explore the idea of ‘back-translation’ to create pseudo Chinese-English translation data for English captions, and adapt our decoder language model by training on this pseudo dataset.

Notes

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Google Scholar
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS, pp. 1171–1179 (2015)
Google Scholar
Bertoldi, N., Barbaiani, M., Federico, M., Cattoni, R.: Phrase-based statistical machine translation with pivot languages. In: IWSLT, pp. 143–149 (2008)
Google Scholar
Cer, D., Manning, C.D., Jurafsky, D.: The best lexical metric for phrase-based statistical mt system optimization. In: NAACL, pp. 555–563 (2010)
Google Scholar
Chen, T.H., Liao, Y.H., Chuang, C.Y., Hsu, W.T., Fu, J., Sun, M.: Show, adapt and tell: adversarial training of cross-domain image captioner. In: ICCV, pp. 521–530 (2017)
Google Scholar
Chen, Y., Liu, Y., Li, V.O.: Zero-resource neural machine translation with multi-agent communication game. In: AAAI, pp. 5086–5093 (2018)
Google Scholar
Cheng, Y., et al.: Semi-supervised learning for neural machine translation. In: ACL, pp. 1965–1974 (2016)
Google Scholar
Cheng, Y., Yang, Q., Liu, Y., Sun, M., Xu, W.: Joint training for pivot-based neural machine translation. In: IJCAI, pp. 3974–3980 (2017)
Google Scholar
Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation, pp. 1724–1734 (2014)
Google Scholar
Cohn, T., Hoang, C.D.V., Vymolova, E., Yao, K., Dyer, C., Haffari, G.: Incorporating structural alignment biases into an attentional neural translation model. In: ACL, pp. 876–885 (2016)
Google Scholar
Cohn, T., Lapata, M.: Machine translation by triangulation: making effective use of multi-parallel corpora. In: ACL, pp. 728–735 (2007)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: ACL, pp. 376–380 (2014)
Google Scholar
Ding, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: CVPR, pp. 2393–2402 (2018)
Google Scholar
El Kholy, A., Habash, N., Leusch, G., Matusov, E., Sawaf, H.: Language independent connectivity strength features for phrase pivot statistical machine translation. In: ACL, pp. 412–418 (2013)
Google Scholar
Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)
Google Scholar
Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F.T.Y., Cho, K.: Zero-resource translation with multi-lingual neural machine translation. In: EMNLP, pp. 268–277 (2016)
Google Scholar
Gu, J., Cai, J., Joty, S., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: CVPR, pp. 7181–7189 (2018)
Google Scholar
Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: coarse-to-fine learning for image captioning. In: AAAI, pp. 6837–6844 (2018)
Google Scholar
Gu, J., Wang, G., Cai, J., Chen, T.: An empirical study of language CNN for image captioning. In: ICCV, pp. 1222–1231 (2017)
Google Scholar
Gu, J.: Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2017)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hitschler, J., Schamoni, S., Riezler, S.: Multimodal pivots for image caption translation. In: ACL, pp. 2399–2409 (2016)
Google Scholar
Jean, S., Cho, K., Memisevic, R., Bengio, Y.: On using very large target vocabulary for neural machine translation. In: ACL, pp. 1–10 (2015)
Google Scholar
Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding long-short term memory for image caption generation. In: ICCV, pp. 2407–2415 (2015)
Google Scholar
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. In: TACL, pp. 339–352 (2016)
Google Scholar
Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: EMNLP, pp. 1700–1709 (2013)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kulkarni, G., et al.: Baby talk: understanding and generating image descriptions. In: CVPR, pp. 1601–1608 (2011)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, C., Sun, F., Wang, C., Wang, F., Yuille, A.: MAT: a multimodal attentive translator for image captioning. In: IJCAI, pp. 4033–4039 (2017)
Google Scholar
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of spider. In: ICCV, pp. 873–881 (2017)
Google Scholar
Luong, M.T., Le, Q.V., Sutskever, I., Vinyals, O., Kaiser, L.: Multi-task sequence to sequence learning. In: ICLR (2016)
Google Scholar
Luong, M.T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation. In: ACL, pp. 11–19 (2015)
Google Scholar
Mi, H., Sankaran, B., Wang, Z., Ittycheriah, A.: Coverage embedding models for neural machine translation. In: EMNLP, pp. 955–960 (2016)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)
Google Scholar
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2016)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 7008–7024 (2017)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)
Google Scholar
Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H.: Modeling coverage for neural machine translation. In: ACL, pp. 76–85 (2016)
Google Scholar
Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based statistical machine translation. In: NAACL, pp. 484–491 (2007)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: Consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. In: PAMI, pp. 652–663 (2017)
Article Google Scholar
Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21, 165–181 (2007)
Article Google Scholar
Wu, J., et al.: AI challenger: a large-scale dataset for going deeper in image understanding. ar**v preprint ar**v:1711.06475 (2017)
Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.V.D.: What value do explicit high level concepts have in vision to language problems? In: CVPR, pp. 203–212 (2016)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Google Scholar
Yang, X., Zhang, H., Cai, J.: Shuffle-then-assemble: learning object-agnostic visual relationship features. In: ECCV (2018)
Google Scholar
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: ICCV, pp. 22–29 (2017)
Google Scholar
You, Q., **, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016)
Google Scholar
Zahabi, S.T., Bakhshaei, S., Khadivi, S.: Using context vectors in improving a machine translation system with bridge language. In: ACL, pp. 318–322 (2013)
Google Scholar
Zhu, Y., et al.: Texygen: a benchmarking platform for text generation models. In: SIGIR, pp. 1097–1100 (2018)
Google Scholar
Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. In: EMNLP, pp. 1568–1575 (2016)
Google Scholar

Download references

Acknowledgments

This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the National Research Foundation, Singapore, and the Infocomm Media Development Authority, Singapore. We gratefully acknowledge the support of NVIDIA AI Tech Center (NVAITC) for our research at NTU ROSE Lab, Singapore.

Author information

Authors and Affiliations

ROSE Lab, Nanyang Technological University, Singapore, Singapore
Jiuxiang Gu
SCSE, Nanyang Technological University, Singapore, Singapore
Shafiq Joty & Jianfei Cai
Alibaba AI Labs, Hangzhou, China
Gang Wang

Authors

Jiuxiang Gu
View author publications
You can also search for this author in PubMed Google Scholar
Shafiq Joty
View author publications
You can also search for this author in PubMed Google Scholar
Jianfei Cai
View author publications
You can also search for this author in PubMed Google Scholar
Gang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiuxiang Gu .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, J., Joty, S., Cai, J., Wang, G. (2018). Unpaired Image Captioning by Language Pivoting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11205. Springer, Cham. https://doi.org/10.1007/978-3-030-01246-5_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-01246-5_31
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01245-8
Online ISBN: 978-3-030-01246-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unpaired Image Captioning by Language Pivoting

Abstract

Similar content being viewed by others

Object-Centric Unsupervised Image Captioning

FocusCap: Object-Focused Image Captioning with CLIP-Guided Language Model

Improving German Image Captions Using Machine Translation and Transfer Learning

Keywords

1 Introduction

3.1 Encoder-Decoder Models for Image Captioning and Machine Translation

4 Experiments

4.1 Implementation Details

4.2 Quantitative Analysis

4.3 Qualitative Results

5 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Unpaired Image Captioning by Language Pivoting

Abstract

Similar content being viewed by others

Object-Centric Unsupervised Image Captioning

FocusCap: Object-Focused Image Captioning with CLIP-Guided Language Model

Improving German Image Captions Using Machine Translation and Transfer Learning

Keywords

1 Introduction

3.1 Encoder-Decoder Models for Image Captioning and Machine Translation

4 Experiments

4.1 Implementation Details

4.2 Quantitative Analysis

4.3 Qualitative Results

5 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation