Keywords

1 Introduction

Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class. Compared to conventional correspondence tasks handling pictures of the same scene, such as stereo matching [1, 2] and motion estimation [3,4,5], the problem of semantic correspondence involves substantially larger changes in appearance and spatial layout, thus remaining very challenging. For this reason, traditional approaches based on hand-crafted features such as SIFT [6, 7] and HOG [8,9,10] do not produce satisfactory results on this problem due to lack of high-level semantics in local feature representations.

Fig. 1.
figure 1

The proposed attentive semantic alignment. Our model estimates dense correspondences of objects by predicting a set of global transformation parameters via an attention process. The attention process spatially focuses on the reliable local transformation features, filtering out irrelevant backgrounds and clutter.

While previous approaches to the problem focus on introducing an effective spatial regularizer in matching [7, 9, 11], recent convolutional neural networks have advanced this area by learning high-level semantic features [12,13,14,15,16,17,18,19,20,21,22,23,24]. One of the main approaches [13] is to estimate parameters of a global transformation model that densely aligns one image to the other. In contrast to other approaches, it casts the whole correspondence problem for all individual features into a simple regression problem with a global transformation model, thus predicting dense correspondences through the efficient pipeline. On the other hand, however, the global alignment approach may be easily distracted; An entire correlation map between all feature pairs across images is used to predict such a global transformation, and thus noisy features from different backgrounds, clutter, and occlusion, may distract the predictor from correct estimation of the alignment. This is a challenging issue, in particular, in the problem of semantic correspondence where a large degree of image variations is often involved.

In this paper, we introduce an attentive semantic alignment method that focuses on reliable correlations, filtering out distractors as shown in Fig. 1. For effective attention, we also propose an offset-aware correlation kernel that learns to capture translation-invariant local transformations in computing correlation values over spatial locations. The resultant feature map of offset-aware correlation (OAC) kernels is computed from two input features, where each activation of the feature map represents how smoothly a source feature is transformed spatially to the target feature map. This use of OAC kernels greatly improves a subsequent attention process. Experiments demonstrate the effectiveness of the attentive model and offset-aware kernel, and the proposed model combining both techniques achieves the state-of-the-art performance.

Our contribution in this work is threefold:

  • The proposed algorithm incorporates an attention process to estimate a global transformation from a set of inconsistent and noisy local transformations for semantic image alignment.

  • We introduce offset-aware correlation kernels to guide the network in capturing local transformations at each spatial location effectively, and employ the kernels to compute feature correlations between two images for better representation of semantic alignment.

  • The proposed network with the attention module and offset-aware correlation kernels achieves the state-of-the-art performances on semantic correspondence benchmarks.

The rest of the paper is organized as follows. We overview the related work in Sect. 2. Section 3 describes our proposed network with the attention process and the offset-based correlation kernels. Finally, we show the experimental results of our method and conclude the paper in Sects. 4 and 5.

2 Related Work

Most approaches to semantic correspondence are based on dense matching of local image features. Early methods extract local features of patches using hand-crafted feature descriptors [25] such as SIFT [7, 11, 26, 27] and HOG [9, 10, 28, 29]. In spite of some success, the lack of high-level semantics in the feature representation makes the approaches suffer from non-rigid deformation and large appearance changes of objects. While such challenges have been mainly investigated in the area of graph-based image matching [28, 30,31,32], recent methods [15,16,17,18,19,20,21,22,23,24] rely on deep neural networks to extract high-level features of patches for robust matching. More recently, Han et al. [14] propose a deep neural network that learns both a feature extractor and a matching model for semantic correspondence. In spite of these developments, all these approaches detect correspondences by matching patches or region proposals based on their local features. In contrast, Rocco et al. [13] propose a global transformation estimation method that is the most relevant work to ours. Their model in [13] predicts the transformation parameters from a correlation map obtained by computing correlations of every pair of features in source and target feature maps. Although this model is similar to ours in that it estimates the global transformation based on correlations of feature pairs, our model is distinguished by the attention process suppressing irrelevant features and the use of the OAC kernels constructing local transformation features.

There are some related studies on other tasks using feature correlations such as optical flow estimation [3] and stereo matching [33, 34]. Dosovitskiy et al. [3] use correlations between features of two video frames to estimate optical flow, while Zbontar et al. [33] and Luo et al. [34] extract feature correlations from patches of images for stereo matching. Although all these methods utilize the correlations, they extract correlations from features in a limited set of candidate regions. Moreover, unlike ours, they do not explore the attentive process and the offset-based correlation kernels.

Lately, attention models have been widely explored for various tasks with multi-modal inputs such as image captioning [35, 36], visual question answering [37, 38], attribute prediction [39] and machine translation [40, 41]. In these studies, models attend to the relevant regions referred and guided by another modality such as language, while the proposed model attends based on a self-guidance. Noh et al. [42] use an attention process for image retrieval to extract deep local features, where the attention is obtained from the features themselves as in our work.

3 Deep Attentive Semantic Alignment Network

We propose a deep neural network architecture for semantic alignment incorporating an attention process with a novel offset-aware correlation kernel. Our network takes as inputs two images and estimates a set of global transformation parameters using three main components: feature extractor, local transformation encoder, and attentive global transformation estimator as presented in Fig. 2. We describe each of these components in details.

Fig. 2.
figure 2

Overall architecture of the proposed network. It consists of three main components: feature extractor, local transformation encoder, and attentive global transformation estimator. For details, see text.

3.1 Feature Extractor

Given source and target images, we first extract their image feature maps \(\varvec{f}^\mathrm {src},\varvec{f}^\mathrm {trg} \in \mathbbm {R}^{D\times H\times W}\) using a fully convolutional image feature extractor, where H and W are height and width of input images, respectively. We use a VGG-16 [43] model pretrained on ImageNet [44] and extract features from its pool4 layer. We share the weights of the feature extractor for both source and target images. Input images are resized into \(240\times 240\) and fed to the feature extractor resulting in \(15\times 15\) feature maps with 512 channels. After extracting the features, we normalize them using \(L_2\) norm.

3.2 Local Transformation Encoder

Given source and target feature maps from the feature extractor, the model encodes local transformations of the source features with respect to the target feature map. The encoding is given by introducing a novel offset-aware correlation (OAC) kernel, which facilitates to overcome limitations of conventional correlation layers [13]. We briefly describe details of the correlation layer including its limitations and discuss the proposed OAC kernel.

Correlation Layer. The correlation layer computes correlations of all pairs of features from the source and the target images [13]. Specifically, the correlation layer takes two feature maps as its inputs and constructs a correlation map \(\varvec{c} \in \mathbbm {R}^{HW \times H \times W}\), which is given by

$$\begin{aligned} c_{i,j} = {f^\mathrm {src}_{i,j}}^\top \hat{\varvec{f}}^\mathrm {trg}, \end{aligned}$$
(1)

where \(c_{i,j} \in \varvec{c}\) is a HW dimensional correlation vector at a spatial location (ij), \(f_{i,j}^\mathrm {src} \in \varvec{f}^\mathrm {src}\) is a feature vector at a location (ij) of the source image, and \(\hat{\varvec{f}}^\mathrm {trg} \in \mathbbm {R}^{D\times HW}\) is a spatially flattened feature map of \(\varvec{f}^\mathrm {trg}\) of the target image. In other words, each correlation vector \(c_{i,j}\) consists of correlations between a single source feature \(f_{i,j}^\mathrm {src}\) and all target features of \(\varvec{f}^\mathrm {trg}\). Although each element of a correlation vector maintains the correspondence likelihood of a source feature onto a certain location in the target feature map, the order of elements in the correlation vector is based on the absolute coordinates of individual target features regardless of the source feature location. This means that decoding the local displacement of the source feature requires not only the vector itself but also the spatial location of the source feature. For example, consider a correlation vector \(c_{i,j}=[1, 0, 0, 0]^\top \) between \(2\times 2\) feature maps, each element of which is the correlation of \(f^\mathrm {src}_{i,j}\) with \(f^\mathrm {trg}_{0,0}\), \(f^\mathrm {trg}_{0,1}\), \(f^\mathrm {trg}_{1,0}\) and \(f^\mathrm {trg}_{1,1}\). The displacement represented by the vector varies with the coordinate of the source feature (ij). When \((i,j)=(0,0)\), it indicates that the source feature \(f^\mathrm {src}_{0,0}\) remains at the same location (0, 0) in the target feature map. When \((i,j)=(0,1)\), it implies that \(f^\mathrm {src}_{0,1}\) is moved to the left of its original location in the target feature map.

Given a correlation map, decoding the local displacement of a source feature requires incorporating the offset information from the source feature to individual target features. And, this local process is crucial for subsequent spatial attention process in the next section. Therefore, we first introduce an offset-aware correlation kernel that utilizes the offset of features during the kernel application.

Fig. 3.
figure 3

Offset-aware correlation kernel at different source locations: (a) at (0, 0) and (b) at (0, 1). Each dotted line connects source and target features to compute correlation, and \(w_{i,j}\) represents a kernel weight for the dotted line. Note that kernel weights are associated with different correlation pairs when source locations vary.

Offset-Aware Correlation Kernels. Similarly to the correlation layer, our OAC kernels also take two input feature maps and utilize correlations of all feature pairs between these feature maps. The kernels naturally capture the displacement of a source feature in the target feature map by aligning kernel weights based on the offset between the source and target features for each correlation as illustrated in Fig. 3. Formally speaking, an OAC kernel captures feature displacement of a source feature \(f^\mathrm {src}_{i,j}\) by

$$\begin{aligned} h_{i, j}^{(n)}&=\sum _{k=1}^{H}{\sum _{l=1}^{W}{w_{i-k, j-l}^{(n)} c_{i,j;k,l}}} \end{aligned}$$
(2)
$$\begin{aligned}&=\sum _{k=1}^{H}{\sum _{l=1}^{W}{w_{i-k, j-l}^{(n)} {f^\mathrm {src}_{i,j}}^\top f^\mathrm {trg}_{k,l}}}, \end{aligned}$$
(3)

where \(h^{(n)}_{i,j}\) is the kernel output with the kernel index n, \(c_{i,j;k,l}\) is the correlation between a source feature \(f^\mathrm {src}_{i,j}\) and a target feature \(f^\mathrm {trg}_{k,l}\), and \(\varPhi ^{(n)}= \{w^{(n)}_{s,t}\}\) is a set of the kernel weights. Note that the kernel weights are indexed by offset between the source and target features, and shared for correlations of any feature pair with the same offset. For example, in Fig. 3a, \(w_{0,0}\) is associated with the target feature at (0, 0) because the source location is (0, 0). The same weight \(w_{0,0}\) is associated with the target feature at (0, 1) when the source location is (0, 1) as in Fig. 3b because the offset between these features is (0, 0). Also note that each kernel output \(h^{(n)}_{i,j}\) at a location (ij) captures the displacement of its corresponding source feature \(f^\mathrm {src}_{i,j}\) at the same location.

While a proposed kernel captures a single aspect of feature displacement, a set of the proposed kernels produce a dense feature representation of feature displacement for each source feature. We use 128 OAC kernels resulting in a feature displacement map \(\varvec{h} \in \mathbbm {R}^{128\times 15\times 15}\) encoding the displacement of each source feature. We set ReLU as the activation functions of the kernel outputs, and compute normalized correlations in OAC kernels since normalization further improves the scores as observed in [13].

In practice, the proposed OAC kernels are implemented by two sub-procedures. We first compute the normalized correlation map reordered based on the offsets between the locations of the source and target features. In this reordered correlation map, every correlation with the same relative displacement is arranged in the same channel. This reordering results in \((2H-1)(2W-1)\) possible offsets and thus the size of the output tensor becomes \((2H-1)(2W-1) \times H \times W\) where many of the values are zeros due to non-existing pairs for some offsets. Then, we use a \(1\times 1\) convolutional layer to compute the dense feature representation from the raw displacement information captured in the reordered correlation map. Note that this process significantly reduces the number of channels by compactly encoding various aspects of the local displacements into dense representations.

Encoding Local Transformation Features. Since the feature displacement map conveys the movement of each source feature independently, each feature alone is not sufficient to predict the global transformation parameters. To allow the network predicts the global transformation from local features in the attention process, we construct a local transformation feature map by combining spatially adjacent feature displacement information captured by \(\varvec{h}\). That is, the proposed network feeds the feature displacement map \(\varvec{h}\) to a \(7\times 7\) convolution layer with 128 output channels applied without padding. This convolution layer results in a local transformation feature map \(\mathcal {F}\in \mathbbm {R}^{128\times 9\times 9}\). Note that each feature \(t_{i,j} \in \mathcal {F}\) captures transformations occurred in a local region. We utilize this local transformation feature map to predict the global transformation through an attention process.

Fig. 4.
figure 4

Illustration of attention process. Noisy features in local transformation feature map are filtered by assigning lower probabilities to these locations. Arrows in boxes of local transformation feature map demonstrate features encoding local transformations, and grayscale colors in attention distribution represent magnitudes of probabilities where brighter colors mean higher probabilities.

3.3 Attentive Global Transformation Estimator

After local transformation encoding, a set of global transformation parameters is estimated with an attention process. Given a local transformation feature map \(\mathcal {F}\in \mathbbm {R}^{\hat{D}\times \hat{H}\times \hat{W}}\) extracted by OAC kernels with a convolution layer, the network focuses on reliable local transformation features by filtering out distracting regions as depicted in Fig. 4 to predict the parameters from the aggregation of those features. Although a feature map \(\mathcal {F}\) gives sufficient information to predict the global transformation from source to target, local transformation features extracted from a real image pair is noisy due to image variations such as background clutter and intra-class variations. Therefore, we propose a model that suppresses unreliable features by the attention process and extracts an attended feature vector that summarizes local transformations from all reliable locations to estimate an accurate global transformation. In other words, the model computes an attended transformation feature \(\tau ^\mathrm {att}\) by

$$\begin{aligned} \tau ^\mathrm {att} = \sum _{i=1}^{\hat{H}}{\sum _{j=1}^{\hat{W}}{\alpha _{i,j}G(t_{i, j})}}, \end{aligned}$$
(4)

where \(G:\mathbbm {R}^{\hat{D}}\rightarrow \mathbbm {R}^{D'}\) is a projection function of \(t_{i,j}\) into a \(D'\) dimensional vector space and \(\varvec{\alpha } = \{\alpha _{i,j}\}\) is an attention probability distribution over feature map. The model computes the attention probabilities by

$$\begin{aligned} \alpha _{i,j} = \frac{\exp \left( {S(t_{i,j})}\right) }{\sum _{k=1}^{\hat{H}}{\sum _{l=1}^{\hat{W}}{\exp \left( {S(t_{k,l})}\right) }}}, \end{aligned}$$
(5)

where \(S:\mathbbm {R}^{\hat{D}}\rightarrow \mathbbm {R}\) is an attention score function producing a single scalar given a local transformation feature. Note that the model learns to suppress noisy features by assigning low attention scores and reducing their contribution to the attended feature.

Once the attended feature \(\tau ^\mathrm {att}\) over all local transformations is obtained, we compute the global transformation \(\theta \in \mathbbm {R}^Q\) by a simple matrix-vector multiplication as

$$\begin{aligned} \theta = W\tau ^\mathrm {att}, \end{aligned}$$
(6)

where \(W \in \mathbbm {R}^{Q\times {D'}}\) is a weight matrix for linear projection of the attended feature \(\tau ^\mathrm {att}\).

In summary, we first compute local transformation between two images and perform a nonlinear embedding using a projection function \(G(\cdot )\). The embedded vector is weighted by spatial attention to compute an attended feature \(\tau ^\mathrm {att}\) as shown in Eq. (4). The global transformation vector is obtained by linear projection of the attended feature, which is parametrized by a matrix as presented in Eq. (6).

We use multi-layer perceptrons (MLPs) for G and S in Eqs. (4) and (5). G is a two-layer MLP with 128 hidden and output ReLU activations. Since the feature representations produced by G is directly used for the final estimation as a linear map** in Eq. (6), we additionally concatenate 5-dimensional index embedding to the feature \(t_{i,j}\in \mathcal {F}\) to better estimate the global transformation from local transformation features. While S is another two-layer MLP with 64 hidden ReLU activations, its output is a scalar without non-linearity; this is due to the application of softmax normalization outside S. Note that we do not use the index embedding to avoid strong biases of attentions on certain regions. Since G and S are applied to all feature vectors across the spatial dimensions, we implement them by multiple \(1\times 1\) convolutions with batch normalizations.

Network Training. We build two of the proposed networks with different parametric global transformations: affine and thin-plate spline (TPS) transformations. To train the network, we adapt the average transformed grid distance loss proposed in [13], which indirectly measures the distance from the predicted transformation parameters \(\theta \) to the ground-truth transformation parameters \(\theta _\mathrm {GT}\). Given \(\theta \) and \(\theta _\mathrm {GT}\), the transformed grid distance \(\mathrm {TGD}(\theta , \theta _\mathrm {GT})\) is obtained by

$$\begin{aligned} \mathrm {TGD}(\theta , \theta _\mathrm {GT}) = \frac{1}{|\mathcal {G}|}\sum _{g \in \mathcal {G}}{d\left( \mathcal {T}_\theta \left( g\right) , \mathcal {T}_{\theta _\mathrm {GT}}\left( g\right) \right) ^2} \end{aligned}$$
(7)

where \(\mathcal {G}\) is a set of points in a regular grid, \(\mathcal {T}_\theta \) is the transformation parameterized by \(\theta \) and \(d(\cdot )\) is a distance measure. We minimize the average TGD of training examples to train the network. Since every operation within the proposed network is differentiable, the network is trainable end-to-end using a gradient-based optimization algorithm. We use ADAM [13] to avoid border artifacts. The synthetic image pairs generated by this process are annotated with the ground-truth transformation parameters \(\theta _\mathrm {GT}\) allowing us to train the network with full supervision. Note that, however, this training scheme can be considered unsupervised since no annotated real dataset is used during training.

For the synthetic dataset generation, we use PASCAL VOC 2011 [46], and build two variations of training datasets with either affine or TPS transformation each for its corresponding network. A set of PASCAL VOC images is kept separate to generate another set of synthetic examples for validation and the best performing models on the validation set is evaluated.

Evaluation. Two public benchmarks called PF-WILLOW and PF-PASCAL [9] are used for the evaluation. PF-WILLOW consists of about 900 image pairs generated from 100 images of 5 object classes. PF-PASCAL contains 1351 image pairs of 20 object classes. Each image pair in both datasets contains different instances of the same object class such as ducks and motorbikes, e.g., left two images in Fig. 1. The objects in these datasets have large intra-class variations and many background clutters making the task more challenging. The image pairs of both PF-WILLOW and PF-PASCAL are annotated with sparse key points that establishe correspondences between two images. Following the standard evaluation metric, the probability of correct keypoint (PCK) [47] of these benchmarks, our goal is to correctly transform the key points in the source image to their corresponding ones in the target image. A transformed source key point is considered correct if its distance to its corresponding target key point is less than \(\alpha \cdot \mathrm {max}(h, w)\), where \(\alpha =0.1\), and h and w are height and width of the object bounding box. Formally, PCK of a proposed model \(\mathcal {M}\) is measured by

(8)

where N is the total number of image pairs, \(\mathcal {P}_i\) is a set of source and target key point pairs \((p_\mathrm {s}, p_\mathrm {t})\) for \(i^\mathrm{th}\) example, \(\theta _i\) is predicted transformation, and is the indicator function which returns 1 if the expression inside brackets is true and 0 otherwise.

We evaluate three different versions of the proposed model as in [13]. The first two versions are the models with different transformations: affine and TPS transformations. The other version sequentially merges these two models. That is, the input image pair are first fed to the network with affine transformation, and the image pair transformed by its out is then fed to the network with TPS transformation.

Table 1. Experimental results on PF-WILLOW and PF-PASCAL. PCK is measured with \(\alpha =0.1\). Scores for other models are brought from [9, 13, 14] while scores marked with an asterisk (*) are drawn from the reproduced models by released official codes. The PCK scores marked with a star (\(^\star \)) are measured with height and width of the image size instead of the bounding box size. Note that the PCK measure with the bounding box size is more conservative than the one with the image size resulting in lower scores.

4.2 Results

Comparisons to Other Models. Table 1 shows the comparative results on both PF-WILLOW and PF-PASCAL benchmarks. It includes (i) previous methods using hand-crafted features: DeepFlow [5], GMK [31], SIFTFlow [7], DSP [11], and ProposalFlow [9], (ii) self-supervised alignment methods: GeoCNN [13] and the proposed attentive alignment network (A2Net), (iii) supervised methods: UCN [12], FCSS [17], and SCNet [14]. Note that the supervised methods are trained with either a weakly or strongly annotated data and that many of their PCKs are measured under a different criterion that are not directly comparable to the other scores. By contrast, our method is only trained using synthetic data with self-supervision. As shown in Table 1, the proposed method substantially outperforms all the other methods that are directly comparable. Using VGG-16 feature extractor, the proposed method improves 12.5% and 5% of PCK over the non-attentive alignment method [13] on PF-WILLOW and PF-PASCAL, respectively. This reveals the effect of the proposed attention model for semantic alignment. The quality of the model is further improved when incorporated with a more advanced feature extractor such as ResNet101. It is notable that the proposed model outperforms some of supervised methods, UCN [12] and FCSS [17], while it is trained without any real datasets.

Table 2. PCKs of ablations on PF-WILLOW trained with PASCAL VOC 2011. Scores of GeoCNN are obtained from the code released by the authors. The numbers of network parameters exclude the feature extractors since all models share the same feature extractor.
Table 3. PCKs of affine models on PF-WILLOW with different training datasets: PASCAL VOC 2011 and Tokyo Time Machine. Scores for GeoCNN are brought from [13].

Ablation Study. As our proposed model combines two distinct techniques we perform ablation studies to demonstrate their effects. We mainly compare the proposed model to GeoCNN as it directly predicts the global transformation parameters using the correlation layer. To see the effect of the proposed OAC kernels, we build a model, referred to as GeoCNN+OACK, by replacing the correlation layer of GeoCNN with the OAC kernels. As shown in Table 2, the use of the OAC kernels already improves the performances of GeoCNN for all three versions. Moreover, the OAC kernels reduce the number of parameters in the network since it uses dense representations of local transformations allowing channel compression. Applying attention process on top of correlation layer (GeoCNN+Attention) drops the performance. This is because the correlation map does not encode local transformations in a translation invariant representations. On the other hand, the attention process with the OAC kernels, which is the proposed model, further improves the performances as the distracting regions can be suppressed during the transformation estimation thanks to the local transformation feature map obtained by the OAC kernels. It is also notable that applying the attention process reduces the number of model parameters because the model does not need extra layers that combine all local information to produce the global estimation; instead, the models simply aggregate local features with attention distribution. This additional parameter reduction results in 70% fewer parameters than GeoCNN while the models maintain superior performance improvements.

Sensitivity to Training Datasets. While both our model and GeoCNN are generally applicable to any image datasets, we experiment the sensitivity of the models to changing training datasets. We train both models with the affine transformation on another image dataset, called Tokyo Time Machine [48], using the same synthetic generation process, and show how much the performances change depending on the datasets. Table 3 shows that the proposed model is less dependent on the choice of the training dataset compared to GeoCNN.

Fig. 5.
figure 5

Qualitative results of the attentive semantic alignment. Each row shows an example of PF-PASCAL benchmark. Given the source and target images shown in first and third columns, we visualize the attention maps of the affine model (second column), the transformed image by the affine model (fourth column) and the final transformed image by the affine+TPS model (last column). Since the models learn inverse transformation, the target image is transformed toward the source image while the attention distribution is drawn over the source image. The model attends to the objects to match and estimates dense correspondences despite intra-class variations and background clutters.

Qualitative Results with Attention Visualizations. Figure 5 presents some qualitative examples of our model on PF-PASCAL. In our experimental setting, the models learn to predict inverse transformation. Therefore, we transform the target image toward the source image using the estimated inverse transformation whereas the attention distribution is drawn over the source image. The proposed model attends to the target objects with other regions suppressed and predicts the global transformation based on reliable local features. The model estimates the transformation despite large intra-class variations such as an adult vs. a kid.

We also investigate some failure cases of the proposed model in Fig. 6. The model is confused when there are multiple objects of the same class in an image or have a large obstacles occluding the matching objects. Also, objects in some examples are hard to visually recognize and lead mismatches. For instance, the model fails to correctly match a wooden chair to a transparent chair although the model attends to the correct region in the second example of Fig. 6. It is challenging even for human to recognize the transparent chair and its corresponding key points.

Fig. 6.
figure 6

Some failure cases of the proposed model with the affine transformation. Each row shows an example of PF-PASCAL. Each example contains (1) source image, (2) source image masked by attention distribution, (3) target image and (4) target image transformed by the predicted affine parameters. Even though the model attends to the matching objects, the model fails to find the correct correspondences due to multiple objects of the same class causing ambiguity or hard examples that are difficult to visually percept.

5 Conclusion

We propose a novel approach for semantic alignment. Our model facilitates an attention process to estimate global transformation from reliable local transformation features by suppressing distracting features. We also propose offset-aware correlation kernels that reorder correlations of feature pairs and produce a dense feature representation of local transformations. The experimental results show the attentive model with the proposed kernels achieves the state-of-the-art performances with large margins over previous methods on the PF-WILLOW and PF-PASCAL benchmarks.