Keywords

1 Introduction

Recent several years have witnessed unprecedented advancements in automatic image caption generation. This progress can be attributed (i) to the invention of novel deep learning framework that learns to generate natural language descriptions of images in an end-to-end fashion, and (ii) to the availability of large annotated corpora of images paired with captions such as MSCOCO [30] to train these models. The dominant methods are based on an encoder-decoder framework, which uses a deep convolutional neural network (CNN) to encode the image into a feature vector, and then use a recurrent neural network (RNN) to generate the caption from the encoded vector [27, 29, 44]. More recently, approaches of using attention mechanisms and reinforcement learning have dominated the MSCOCO captioning leaderboard [1, 18, 39].

Despite the impressive results achieved by the deep learning framework, one performance bottleneck is the availability of large paired datasets because neural image captioning models are generally annotation-hungry requiring a large amount of annotated image-caption pairs to achieve effective results [19]. However, in many applications and languages, such large-scale annotations are not readily available, and are expensive and slow to acquire. In these scenarios, unsupervised methods that can generate captions from unpaired data or semi-supervised methods that can exploit paired annotations from other domains or languages are highly desirable [5]. In this paper, we pursue the later research avenue, where we assume that we have access to image-caption paired instances in one language (Chinese), and our goal is to transfer this knowledge to a target language (English) for which we do not have such image-caption paired datasets. We also assume that we have access to a separate source-target (Chinese-English) parallel corpus to help us with the transformation. In other words, we wish to use the source language (Chinese) as a pivot language to bridge the gap between an input image and a caption in the target language (English).

The concept of using a pivot language as an intermediary language has been studied previously in machine translation (MT) to translate between a resource-rich language and a resource-scarce language [6, 25, 42, 46]. The translation task in this strategy is performed in two steps. A source-to-pivot MT system first translates a source sentence into the pivot language, which is in turn translated to the target language using a pivot-to-target MT system. Although related, image captioning with the help of a pivot language is fundamentally different from MT, since it involves putting together two different tasks – captioning and translation. In addition, the pivot-based pipelined approach to MT suffers from two major problems when it comes to image captioning. First, the conventional pivot-based MT methods assume that the datasets for source-to-pivot and pivot-to-target translations come from the same (or similar) domain(s) with similar styles and word distributions. However, as it comes to image captioning, captions in the pivot language (Chinese) and sentences in the (Chinese-English) parallel corpus are quite different in styles and word distributions. For instance, MSCOCO captioning dataset mostly consists of images of a large scene with object instances (nouns), whereas language parallel corpora are more generic. Second, the errors made in the source-to-pivot translation get propagated to the pivot-to-target translation module in the pipelined approach.

In this paper, we present an approach that can effectively capture the characteristics of an image captioner from the source language and align it to the target language using another source-target parallel corpus. More specifically, our pivot-based image captioning framework comprises an image captioner image-to-pivot, an encoder-decoder model that learns to describe images in the pivot language, and a pivot-to-target translation model, another encoder-decoder model that translates the sentence in pivot language to the target language, and these two models are trained on two separate datasets. We tackle the variations in writing styles and word distributions in the two datasets by adapting the language translation model to the captioning task. This is achieved by adapting both the encoder and the decoder of the pivot-to-target translation model. In particular, we regularize the word embeddings of the encoder (of pivot language) and the decoder (of target language) models to make them similar to image captions. We also introduce a joint training algorithm to connect the two models and enable them to interact with each other during training. We use AIC-ICC [

$$\begin{aligned} y \sim \arg \max _{y}\big \{P(y|i; \mathbf {\theta }_{i \rightarrow y})\big \} \end{aligned}$$
(1)

where \(\mathbf {\theta }_{i \rightarrow y}\) are the model parameters to be learned in the absence of any paired data, \(i^{(n_i)} \nleftrightarrow y^{(n_y)}\). We use the pivot language x to learn the map**: \(i \xrightarrow []{\theta _{i\rightarrow x}} x \xrightarrow []{\theta _{x \rightarrow y}} y\). Note that image-to-pivot (\({D_{i,x}}\)) and pivot-to-target (\({D_{x,y}}\)) in our setting are two distinct datasets with possibly no common elements.

Fig. 1.
figure 1

Pictorial depiction of our pivot-based unpaired image captioning setting. Here, i, x, y, and \(\hat{y}\) denote source image, pivot language sentence, target language sentence, and ground truth captions in target language, respectively. We use a dashed line to denote that there is no parallel corpus available for the pair. Solid lines with arrows represent decoding directions. Dashed lines inside a language (circle) denote stylistic and distributional differences between caption and translation data.

Figure 1 illustrates our pivot-based image captioning approach. We have an image captioning model \(P(x|i; \theta _{i\rightarrow x})\) to generate a caption in the pivot language from an image and a NMT model \(P(y|x; \theta _{x\rightarrow y})\) to translate this caption into the target language. In addition, we have an autoencoder in the target language \(P(\hat{y}|\hat{y}; \theta _{\hat{y}\rightarrow \hat{y}})\) that guides the target language decoder to produce caption-like sentences. We train these components jointly so that they interact with each other. During inference, given an unseen image i to be described, we use the joint decoder:

$$\begin{aligned} y \sim \arg \max _{y}\big \{P(y|i; \mathbf {\theta }_{i \rightarrow x}, \mathbf {\theta }_{x \rightarrow y}) \big \} \end{aligned}$$
(2)

In the following, we first give an overview of neural methods for image captioning and machine translation using paired (parallel) data. Then, we present our approach that extends these standard models for unpaired image captioning with a pivot language.

3.1 Encoder-Decoder Models for Image Captioning and Machine Translation

Standard Image Captioning. For image captioning in the paired setting, the goal is to generate a caption \(\tilde{x}\) from an image i such that \(\tilde{x}\) is as similar to the ground truth caption x. We use \(P_x(x|i; \mathbf {\theta }_{i \rightarrow x})\) to denote a standard encoder-decoder based image captioning model with \(\theta _{\text {i}\rightarrow \text {x}}\) being the parameters. We first encode the given image to the image features v with a CNN-based image encoder: \(v=\text {CNN}(i)\). Then, we predict the image description x from the global image feature v. The training objective is to maximize the probability of the ground truth caption words given the image:

$$\begin{aligned} \tilde{\mathbf {\theta }}_{i \rightarrow x}&= \arg \max _{\mathbf {\theta }_{i \rightarrow x}}\big \{ \mathcal {L}_{i \rightarrow x} \big \} \end{aligned}$$
(3)
$$\begin{aligned}&=\arg \max _{\mathbf {\theta }_{i \rightarrow x}}\big \{ \sum _{n_i=0}^{N_i-1} \sum _{t=0}^{M^{(n_i)}-1} \log P_x(x_t^{(n_i)}|x^{(n_i)}_{0:t-1},i^{(n_i)}; \mathbf {\theta }_{i \rightarrow x}) \big \} \end{aligned}$$
(4)

where \(N_i\) is the number of image-caption pairs, \({M^{(n_i)}}\) is the length of the caption \(x^{(n_i)}\), \(x_t\) denotes a word in the caption, and \(P_x(x_t^{(n_i)}|x^{(n_i)}_{0:t-1},i^{(n_i)})\) corresponds to the activation of the Softmax layer. The decoded word is drawn from:

$$\begin{aligned} x_t\sim \arg \max _{\mathcal {V}_{i \rightarrow x}^{x}} P(x_t|x_{0:t-1};i) \end{aligned}$$
(5)

where \(\mathcal {V}_{i \rightarrow x}^{x}\) is the vocabulary of words in the image-caption dataset \(D_{i,x}\).

Neural Machine Translation. Given a pair of source and target sentences (xy), the NMT model \(P_y(y|x; \mathbf {\theta }_{x \rightarrow y})\) computes the conditional probability:

$$\begin{aligned} P_y(y|x)=\prod _{t=0}^{N-1}P(y_t|y_{0:t-1};x_{0:M-1}) \end{aligned}$$
(6)

where M and N are the lengths of the source and target sentences, respectively. The maximum-likelihood training objective of the model can be expressed as:

$$\begin{aligned} \tilde{\mathbf {\theta }}_{x \rightarrow y}&= \arg \max _{\mathbf {\theta }_{x \rightarrow y}}\big \{ \mathcal {L}_{x \rightarrow y} \big \} \end{aligned}$$
(7)
$$\begin{aligned}&=\arg \max _{\mathbf {\theta }_{x \rightarrow y}}\Big \{ \sum _{n_x=0}^{N_x-1} \sum _{t=0}^{N^{(n_x)}-1} \log P_y(y_t^{(n_x)}|y^{(n_x)}_{0:t-1}; x^{(n_x)}; \mathbf {\theta }_{x \rightarrow y})\big \} \end{aligned}$$
(8)

During inference we calculate the probability of the next symbol given the source sentence encoding and the decoded target sequence so far, and draw the word from the dictionary according to the maximum probability:

$$\begin{aligned} y_t\sim \arg \max _{\mathcal {V}_{x \rightarrow y}^{y}} P(y_t|y_{0:t-1};x_{0:M-1}) \end{aligned}$$
(9)

where \(\mathcal {V}_{x \rightarrow y}^{y}\) is the vocabulary of the target language in the translation dataset \(D_{x,y}\).

Unpaired Image Captioning by Language Pivoting. In the unpaired setting, our goal is to generate a description y in the target language for an image i without any pair information. We assume, there is a second language x called “pivot” for which we have (separate) image-pivot and pivot-target paired datasets. The image-to-target model in the pivot-based setting can be decomposed into two sub-models by treating the pivot sentence as a latent variable:

$$\begin{aligned}&P(y|i; \mathbf {\theta }_{i \rightarrow x}, \mathbf {\theta }_{x \rightarrow y}) = \sum _{x} P_x(x|i; \mathbf {\theta }_{i \rightarrow x}) P_y(y|x; \mathbf {\theta }_{x \rightarrow y}) \end{aligned}$$
(10)

where \(P_x(x|i; \mathbf {\theta }_{i \rightarrow x})\) and \(P_y(y|x; \mathbf {\theta }_{x \rightarrow y})\) are the image captioning and NMT models, respectively. Due to the exponential search space in the pivot language, we approximate the captioning process with two steps. The first step translates the image i into a pivot language sentence \(\tilde{x}\). Then, the pivot language sentence is translated to a target language sentence \(\tilde{y}\). To learn such a pivot-based model, a simple approach is to combine the two loss functions in Eqs. (3) and (7) as follows:

$$\begin{aligned} \mathcal {J}_{i \rightarrow x, x \rightarrow y} = \mathcal {L}_{i \rightarrow x} + \mathcal {L}_{x \rightarrow y} \end{aligned}$$
(11)

During inference, the decoding decision is given by:

$$\begin{aligned} \tilde{x} =&\arg \max _{x}\big \{ P_x(x|i; \tilde{\mathbf {\theta }}_{i \rightarrow x}) \big \} \end{aligned}$$
(12)
$$\begin{aligned} \tilde{y} =&\arg \max _{y}\big \{ P_y(y|\tilde{x}; \tilde{\mathbf {\theta }}_{x \rightarrow y}) \big \} \end{aligned}$$
(13)

where \(\tilde{x}\) is the image description generated from i in the pivot language, \(\tilde{y}\) is the translation of \(\tilde{x}\), and \(\tilde{\theta }_{i\rightarrow x}\) and \(\tilde{\theta }_{x\rightarrow y}\) are the learned model parameters.

However, this pipelined approach to image caption generation in the target language suffers from couple of key limitations. First, image captioning and machine translation are two different tasks. The image-to-pivot and pivot-to-target models are quite different in terms of vocabulary and parameter space because they are trained on two possibly unrelated datasets. Image captions contain description of objects in a given scene, whereas machine translation data is more generic, in our case containing news event descriptions, movie subtitles, and conversational texts. They are two different domains with differences in writing styles and word distributions. As a result, the captions generated by the pipeline approach may not be similar to human-authored captions. Figure 1 distinguishes between the two domains of pivot and target sentences: caption domain and translation domain (see second and third circles). The second limitation is that the errors made by the image-to-pivot captioning model get propagated to the pivot-to-target translation model.

To overcome the limitations of the pivot-based caption generation, we propose to reduce the discrepancy between the image-to-pivot and pivot-to-target models, and to train them jointly so that they learn better models by interacting with each other during training. Figure 2 illustrates our approach. The two models share some common aspects that we can exploit to connect them as we describe below.

Fig. 2.
figure 2

Illustration of our image captioning model with pivot language. The image captioning model first transforms an image into latent pivot sentences, from which our machine translation model generates the target caption.

Connecting Image-to-Pivot and Pivot-to-Target. One way to connect the two models is to share the corresponding embedding matrices by defining a common embedding matrix for the decoder of image-to-pivot and the encoder of pivot-to-target. However, since the caption and translation domains are different, their word embeddings should also be different. Therefore, rather than having a common embedding matrix, we add a regularizer \(\mathcal {R}_{i \rightarrow y}\) that attempts to bring the input embeddings of the NMT model close to the output embeddings of image captioning model by minimizing their \(l_2\) distance. Formally,

$$\begin{aligned} \mathcal {R}_{i \rightarrow y}(\theta _{i\rightarrow x}^{w_x}, \theta _{x\rightarrow y}^{w_x}) = - \sum _{w_x \in \mathcal {V}^{x}_{i \rightarrow x} \cap \mathcal {V}^{x}_{x \rightarrow y}} || \mathbf {\theta }^{w_x}_{i \rightarrow x} - \mathbf {\theta }^{w_x}_{x \rightarrow y} ||_2 \end{aligned}$$
(14)

where \(w_x\) is a word in the pivot language that is shared by the two embedding matrices, and \(\mathbf {\theta }_{i \rightarrow x}^{w_x} \in \mathbb {R}^{d}\) denotes the vector representation of \(w_x\) in the source-to-pivot model, and \(\mathbf {\theta }_{x \rightarrow y}^{w_x} \in \mathbb {R}^{d}\) denotes the vector representation of \(w_x\) in the pivot-to-target model. Note that, here we adapt \(\mathbf {\theta }_{x \rightarrow y}^{w_x}\) towards \(\mathbf {\theta }_{i \rightarrow x}^{w_x}\), that is, \(\mathbf {\theta }_{i \rightarrow x}^{w_x}\) is already a learned model and kept fixed during adaptation.

Adapting the encoder embeddings of the NMT model does not guarantee that the decoder of the model will produce caption-like sentences. For this, we need to also adapt the decoder embeddings of the NMT model to the caption data. We first use the target-target parallel corpus \(D_{\hat{y},\hat{y}}=\{(\hat{y}^{(n_{\hat{y}})},\hat{y}^{(n_{\hat{y}})})\}_{n_{\hat{y}=0}}^{N_{\hat{y}}-1}\) to train an autoencoder \(P(\hat{y}|\hat{y};\theta _{\hat{y}\rightarrow \hat{y}})\), where \(\theta _{\hat{y}\rightarrow \hat{y}}\) are the parameters of the autoencoder. The maximum-likelihood training objective of autoencoder can be expressed as:

$$\begin{aligned} \tilde{\mathbf {\theta }}_{\hat{y}\rightarrow \hat{y}} = \arg \max _{\mathbf {\theta }_{\hat{y}\rightarrow \hat{y}}}\big \{ \mathcal {L}_{\hat{y} \rightarrow \hat{y}} \big \} \end{aligned}$$
(15)

where \(\mathcal {L}_{\hat{y} \rightarrow \hat{y}}\) is the cross-entropy (XE) loss. The autoencoder then “teaches” the decoder of the translation model \(P(y|x;\theta _{x\rightarrow y})\) to learn similar word representations. This is again achieved by minimizing the \(l_2\) distance between two vectors:

$$\begin{aligned} \mathcal {R}_{x\rightarrow \hat{y}}(\theta _{x\rightarrow y}^{w_y}, \theta _{\hat{y}\rightarrow \hat{y}}^{w_y}) = - \sum _{w_y \in \mathcal {V}^{y}_{x \rightarrow y} \cap \mathcal {V}^{y}_{\hat{y} \rightarrow \hat{y}}} || \mathbf {\theta }^{w_y}_{x \rightarrow y} - \mathbf {\theta }^{w_y}_{\hat{y} \rightarrow \hat{y}}||_2 \end{aligned}$$
(16)

where \(\mathcal {V}^{y}_{\hat{y} \rightarrow \hat{y}}\) is the vocabulary of y in \(D_{\hat{y},\hat{y}}\), and \(w_y\) is a word in the target language that is shared by the two embedding matrices. By optimizing Eq. 16, we try to make the learned caption share a similar style as the target captions.

Joint Training. In training, our goal is to find a set of source-to-target model parameters that maximizes the training objective:

$$\begin{aligned} \mathcal {J}_{i \rightarrow x, x \rightarrow y, y \rightarrow \hat{y}}&= \mathcal {L}_{i \rightarrow x} + \mathcal {L}_{x \rightarrow y}+\mathcal {L}_{\hat{y} \rightarrow \hat{y}}+\lambda \mathcal {R}_{i \rightarrow x, x \rightarrow y, y \rightarrow \hat{y}} \end{aligned}$$
(17)
$$\begin{aligned} \mathcal {R}_{i \rightarrow x, x \rightarrow y, y \rightarrow \hat{y}}&= \mathcal {R}_{i \rightarrow y}(\theta _{i\rightarrow x}^{w_x}, \theta _{x\rightarrow y}^{w_x})+ \mathcal {R}_{x \rightarrow \hat{y}}(\theta _{x\rightarrow y}^{w_y}, \theta _{\hat{y} \rightarrow \hat{y}}^{w_y}) \end{aligned}$$
(18)

where \(\lambda \) is the hyper-parameter used to balance the preference between the loss terms and the connection terms. Since both the captioner \(P_x(x|i;\theta _{i\rightarrow x})\) and the translator \(P_y(y|x;\theta _{x\rightarrow y})\) have large vocabulary sizes (see Table 1), it is hard to train the joint model with an initial random policy. Thus, in practice, we pre-train the captioner, translator and autoencoder first, and then jointly optimize them with Eq. (17).