1 Introduction

Neural ranking models focus on semantic matching between the query and document with neural networks to solve the ad hoc retrieval problem. Recently, BERT-based ranking models learn latent knowledge for document ranking from large scale text collections. Taking the concatenation of the query and document as input, BERT models their interactions at the word level as the self-attention matrix. In this sense, BERT-based ranking models, which is naturally fit for the ad hoc retrieval task, belong to the interaction-based neural ranking models.

However, interaction-based models only care for defining the interaction function between the query and document [11]. BERT’s self-attention matrix is such an interaction function, which models all possible kinds of word relations that are useful for the matching process, such as query–document and document–document word relations. Traditional interaction functions only consider query–document word relations, but BERT takes query–query and document–document word relations into consideration. Whether these additional relations do good to the relevance prediction performance remains unknown.

As mentioned above, all document words are not related to the query. The document representation derived from these words is composed of query related and unrelated parts. Thus, the relevance score of a document to a query is usually determined by the query related part of the document representation instead of the unrelated part [11]. However, it is hard to point out which part is related to the query and which is not, although it is important to disentangle the related part from the document representation to derive the final relevance score.

Fig. 1
figure 1

Illustration of spurious word relations and latent representations in BERT-based ranking model

We probe into how spurious word relations and latent representations have effect on the retrieval performance. BERT’s self-attention matrix provides a complete word relation knowledge base like B in Fig. 1. (1) For relevance prediction of a query–document pair, spurious word relations exist in such a base. For query “apple song kids” and a document with title “50 Classic Kid’s Songs on Apple Music” in Fig. 1, word relations between “apple” and the other words in the document mislead the model to predict the label of the actually irrelevant document to be relevant. (2) Document representations derived from B contains some unrelated information. Irrelevant words, such as “classic” and “Apple”, make the downstream classifier confused and predict the label of the actually irrelevant document to be relevant. Therefore, it is necessary to remove the effect of spurious word relations and unrelated information in the latent representations.

Taking both the spurious word relations and unrelated representations as confounding factors, we depict the causal graph as the right part of Fig. 1. For each query–document pair X and its relevance label Y, those confounding factors correlates X and Y even though there is no direct causation between X and Y. The self-attention matrix B is generated by the input query–document embedding X and measures the word similarity from the document and query. Z represents the latent representations learned from BERT. For the observed confounding factors, i.e. spurious word relations in the self-attention matrix B, we block the unreasonable path, i.e. the back-door path, \(X\rightarrow B\rightarrow Z\rightarrow Y\), which has an effect on Y. For the unobserved confounding factors, i.e. unrelated representations in Z, we reduce their effects on Y by resolve the front-door path \(X\rightarrow Z\rightarrow Y\).

To reduce effects of confounding factors like the causal graph in Fig. 1, we propose a Disentangled Graph Recurrent neural network method to decouple word representations learned from BERT for document ranking, referred to as DGRe. Specifically, we first design a causal graph for the document ranking task and cast the problem in a causal inference framework. Then an adaptive masking method is proposed to alleviate the observed confounding effect through the transformer layer. After the word refinement layer, a mutual information decomposition layer is finally introduced to disentangle the document representation into query related and unrelated parts owing to the unobserved confounding effects on representations.

For each query and document pair X, DGRe first takes their concatenation as input and obtain word representations through the transformer layer, from which a latent graph is derived as a self attention matrix. Next an adaptive masking method is proposed to disentangle word relations in this latent graph by a sharp activation function ReLu, which aims at kee** relations with higher attention weights and removing relations with lower weights. Then through the word representation refinement layer, word representations are updated with a gated recurrent unit over this disentangled graph to achieve the back-door adjustment \(X\rightarrow B\rightarrow Z\) to deal with the observable confounder in B. Afterward for the unobservable confounder in Z, we realize the do-calculus of Z in the front-door adjustment \(X\rightarrow Z\rightarrow Y\) through mutual information decomposition layer. It decomposes the derived document representation into query related and unrelated parts according to the query’s attention weights.

All the representations derived from the BERT layer, word representation refinement layer and mutual information decomposition layer are aggregated through multi-layer perceptrons and classified with a Sigmoid function. Pairwise ranking loss is a function of relevance scores. Moreover, a triangle distance loss is proposed as function of query, document and query–document pair representations to learn discriminative representations. Finally, mutual information regularization is proposed to minimize the mutual information between two parts. All loss functions are optimized jointly in an end-to-end manner. Experiments on public benchmark datasets Robust04 and WebTrack2009-12 are conducted to show the effectiveness of DGRe. Detailed implementations are further analyzed in experiments, such as the effect of additional word relations on query–document relations.

To sum up, our major contributions lie in the following aspects.

  1. (1)

    A causal graph is designed for BERT-based ranking models, to disentangle the intrinsic reason and confounding factors for the relevance between the query and document.

  2. (2)

    To reduce the observable confounding effect on word relations, an adaptive masking method is proposed to identify useful word relations from the learned self-attention matrix, and word representation refinement is performed over this disentangled word graph for each query–document pair.

  3. (3)

    To reduce the unobservable confounding effect on the document word latent representation, mutual information decomposition layer is introduced to decouple the document representation into two parts, i.e. query related and unrelated representations.

  4. (4)

    Besides the pairwise ranking loss function for the basic document ranking task, a triangle distance loss function for the transformer layer is to learn discriminative representations for the downstream ranking task, and the mutual information regularization for the decomposition layer is to disentangle the document representation.

2 Related Work

Here we briefly review some related studies on interaction-based neural ranking models, BERT-based ranking models, causal inference and other related techniques, such as mutual information and graph neural network.

2.1 Interaction-Based Neural Ranking Models

Interaction-based neural ranking models assume that relevance is in essence about the relation between input texts, and it is more effective to learn from interactions rather than individual representations. They focus on designing the interaction function to produce the relevance score. Existing interaction functions are divided into two kinds: non-parametric and parametric interaction functions [11].

Traditional non-parametric interaction functions include binary indicator, cosine similarity, dot product, radial basis function and so on. DRMM [10] converts a local interaction matrix for the query–document word pair to a fixed-length matching histogram for relevance matching. MatchPyramid [22] incorporates BERT’s classification vector into existing neural models, such as DRMM [10] and Conv-KNRM [8]. PARADE [19] leverages passage-level representations to predict a document’s relevance score without passage independence assumption. PARADE improves its performance by fine tuning on the MSMARCO passage ranking dataset instead of the Bing search log. Other researches focus on how to improve the efficiency of PNLM in retrieval tasks. PreTTR [23] precomputes part of the document term representations at indexing time, and merge them with the query representation at query time to compute the final ranking score. DeepCT [7] maps the contextualized term representations from BERT into context-aware term weights for efficient passage retrieval.

Existing BERT-based ranking models focus on how to design the input of BERT layer and take advantage of its output to be adaptable to the document ranking task. In this paper, we explore the underlying reasons inside the BERT layer for the document relevance score to a query and remove possible confounding factors for the BERT layer to derive the relevance score.

2.3 Causal Inference in Neural Network

Causal inference [26] provides researchers with a new methodology to design more robust models. Some studies focus on how to generate counterfactual samples from the perspective of causal inference to improve the performance of the model [1, 16, 30]. Other studies explore how to remove biases in the data sets [31, 35,36,37]. These studies usually assume that the confounder is observable [31, 36] or domain-specific knowledge [3, 13].

We design the causal graph deep inside the self-attention structure, which is similar to the causal attention [35] in computer vision. Different from the causal attention [35] to alleviate the dataset bias, our designed causal graph is to remove the confounding effects of spurious information on document ranking task. The causal attention [35] is implemented as sampling techniques within one training sample or across several samples. DGRe attempts to reduce the confounding effects by adaptive masking method and the mutual information decomposition layer.

2.4 Other Related Techniques

Mutual information-based methods have been studied for a long history especially in unsupervised representation learning. Benefit from the increasing attention of researchers on mutual information estimation [2, 4, 5, 20], we can efficiently estimate the mutual information of two latent variables through a neural network. DIM [14] introduces a new representation learning loss function by maximizing mutual information in an unsupervised way. GMI [28] brings mutual information into graph representation learning to alleviate the problem of lacking available supervision and avoid potential risks from unreliable labels. In addition, SSD [12] is a disentanglement framework, where mutual information is served as supervision signals for domain adaptation tasks. For DGRe, we utilize mutual information constraints similar to [12], but apply them into the disentanglement of the document representation for document ranking task.

Graph neural network (GNN) has been widely studied in many fields because of its high-order relation capture ability. The information propagation step is key to obtain the hidden states of nodes (or edges) for GNN. According to different information propagation methods, GNN can be divided into convolution based, attention based and recursive-based models so on [38]. Convolution-based GNN, extending convolution operation to the graph domain, includes spectral approaches and spatial approaches. Through the attention mechanism, attention-based GNN focuses on important nodes in the graph and important information of these nodes for the sake of improving the signal-to-noise ratio of the original data [32]. Recursive-based GNN attempts to use the gate mechanism like GRU [18] in the propagation step to improve the long-term propagation of information across the graph structure. Here we explore a combination of transformer and recursive-based GNN to refine word representations over a disentangled graph for BERT-based ranking models.

3 Method

To solve the ad hoc document retrieval problem, we first describe the causal inference framework for BERT-based ranking models. Then a network architecture is proposed to performance the causal inference at both the word and document levels. Finally, an additional loss function is introduced to ensure the document representation decomposable.

3.1 Problem Formalization

Ad hoc document retrieval task is to produce the ranking of documents in a corpus given a short query. There are Q queries \(\{q_i\}_{i=1}^Q\) for training. Each query q is represented as a word sequence \(s^q=\text {w}_1^q,\text {w}_2^q,\ldots ,\text {w}_m^q\) and also associated with a document set \(D_q=\{(d_j,y_j)\}_{j=1}^{n_q}\). \(y_j\in \{0,1\}\) is the ground truth relevance label of document \(d_j\). Non-relevant documents from \(D_q\) are denoted as \(D_q^-\) (\(|D_q^-|=n_q^-\)), and relevant documents denoted as \(D_q^+\) (\(|D_q^+|=n_q^+\)). Document \(d\in D_q\) is denoted as a word sequence \(s^d=\text {w}_1^{d},\text {w}_2^{d},\ldots ,\text {w}_n^{d}\). How to model the text matching between the query and document is key to neural ranking models.

3.2 Causal Inference Framework for Document Ranking

We utilize the causal graph [27] to depict the causal effect in the matching process between the query and document. Due to its intrinsic interaction-based neural model, BERT-based ranking models usually take the concatenation of a query q and document d as the input, i.e. \(X=(q,d)\). From the perspective of the matching process, there is redundant information in terms of words and documents, which may lead to the spurious correlation between X and Y. One lies in the self-attention matrix B generated from X, which provides some harmful word relations for the matching process. For example in Fig. 1, the document word relation between “apple” and “song” hinders the model from predicting this document to be irrelevant to the query. The other is that not all words in a document d are related to a query q regardless of the ground truth relevance label of (qd). When human judge whether q and d is relevant, it is usually determined by the document’s query related part instead of the query unrelated part [11].

Fig. 2
figure 2

Causal graph for BERT-based ranking models

To emphasize the common cause of X and Y, we extends the causal graph in Fig. 1 and derive the graph in Fig. 2 to describe two confounding factors mentioned above. Based on this causal graph, the document ranking task is to answer the do-operation query P(Y|do(X)). Y is the binary relevance label, i.e. relevant or not. In practice, the probability is usually computed through a sigmoid layer. For simplicity, we suppose \(P(Y|do(X))\propto \exp (g(\cdot ))\). Similarly, other probabilities here are also supposed to be computed with a softmax/sigmoid layer in proportion to the exponential form \(\exp (\cdot )\).

To remove the query unrelated part of the document representation, we block the front-door path \(X\rightarrow Z\rightarrow Y\) by the unobservable confounder Z. According to the front-door path adjustment [27], we deconfound the factor Z by Eq. (1). Different from traditional front-door adjustment [35], we resolve the do-calculus of Z by decomposing it into query related and unrelated parts, i.e. \(Z_r\) and \(Z_n\). To calculate the expectation \(\mathbb {E}_Z[Z]\) in Eq. (4), we introduce a mutual information decomposition layer in Fig. 3 to split the document representation into two independent parts.

$$\begin{aligned} P(Y|do(X))&=\sum _{z}P(Y|do(Z))P(Z|do(X))\\&=\sum _{Z_j\in \{Z_r,Z_n\}}P(Y|Z_j)P(Z_j|do(X))\\ \end{aligned}$$
(1)

To eliminate harmful word relations from the self-attention matrix, we block the back-door path \(X\leftarrow B\rightarrow Y\) by the confounding factor B. Specifically, we estimate the do-operation query P(Z|do(X)) in Fig. 2 by Eq. (2) to keep useful word relations in the self-attention matrix B. Suppose word relations with the positive similarity in B, denoted as \(B_+\), have positive effect on the performance. The do-calculus of X is resolved with disentangling useful word relations from spurious ones in B. To estimate the expectation \(\mathbb {E}_{B}\left[ X\right]\) in Eq. (4), we design an adaptive masking method to obtain the disentangled graph and perform message passing over this disentangled word graph to refine word representations in Fig. 3.

$$\begin{aligned} P(Z_j|do(X))=\sum _{B_i\in \{B_+,B_-\}}P(Z_j|X,B_i)P(B_i) \end{aligned}$$
(2)

Replacing \(P(Z_j|do(X))\) in Eq. (1) with Eq. (2), the prediction function P(Y|do(X)) is obtained as Eq. (3). The expectation of an exponential function can be approximated by Weighted Geometric Mean [29, 33, 35]. So, the approximation of Eq. (3) is the weighted geometric mean of P(Y|XZ), which can be further approximated by exchange the order of exponential and expectation operator as Eq. (4). The sigmoid layer will be used for normalization to derive the probability P(Y|do(X)).

$$\begin{aligned} P(Y|do(X))&= \mathbb {E}_{Z}\mathbb {E}_B\left[ P(Y|X,Z)\right] \end{aligned}$$
(3)
$$\begin{aligned}&\propto \exp \left( {g(\mathbb {E}_{Z}[Z ],\mathbb {E}_{B}[X])}\right) \end{aligned}$$
(4)
Fig. 3
figure 3

Disentangled graph recurrent network architecture

3.3 Architecture

Given a query–document pair \(X=(q,d)\), self-attention mechanism in the Transformer layer of Fig. 3 provides us a natural way to model their interaction and at the same time the confounding factor to predict its label P(Y|X). We introduce the causal graph to reduce its negative effect on the performance and resolve the do-query P(Y|do(X)) based on this graph by both back-door and front-door adjustments. And we arrive at the do-free form \(\exp (g(\mathbb {E}_{B}[X],\mathbb {E}_Z[Z]))\), which can be implemented as neural network layers. Next, the disentangled word representations \(\mathbb {E}_B[X]\) is refined over the word graph generated from the transformer layer under the supervision of the adaptive masking method in Fig. 3. Then, we further decompose the document word representations into two parts with the query attention mechanism and derive the query related document word representations to approximate \(\mathbb {E}_Z[Z]\). Finally, multi-layer perceptrons (MLP) are utilized to aggregate all these word representations and followed by a sigmoid layer to predict the relevance probability of (qd), which is shown in the rightmost Fig. 3.

Fig. 4
figure 4

Bipartite word graphs constructed from two strategies. Blue, green and gray color represent the word attention score between query and query, document and document, query and document separately. White means no word relation

3.3.1 Transformer Layer

For each query–document pair (qd), two word sequences are concatenated, i.e. \(X^{(q,d)}=[[\text {CLS}],s^q,[\text {SEP}],s^d,[\text {SEP}]]\). Its input embedding \(\mathbf {I}^{(q,d)}\) is derived from the sum of the word embedding and its corresponding position embedding of \(X^{(q,d)}\). Then \(\mathbf {I}^{(q,d)}\) is fed into BERT stacked with L identical layers. For example, \(L=12\) in BERT-base. For each word i at each layer \(l=1,\ldots ,L\), its word representation \(\mathbf {E}_{l}^{(q,d)}(i)\in \mathbb {R}^{d_k}\) is obtained by weighted summing the other word representations in Eq. (6), \(d_k\) is the dimension of word representations.

$$\begin{aligned} \mathcal {A}_{l-1}^{(q,d)}= & {} \text {softmax}\left( \frac{(\mathbf {W}_B\mathbf {E}_{l-1}^{(q,d)})(\mathbf {W}_B\mathbf {E}_{l-1}^{(q,d)})'}{\sqrt{d_k}}\right) \end{aligned}$$
(5)
$$\begin{aligned} \mathbf {E}_{l}^{(q,d)}(i)= & {} \mathbf {E}_{l-1}^{(q,d)}(i)+\sum _{j}\mathcal {A}_{l-1}^{(q,d)}(i,j)\mathbf {E}_{l-1}^{(q,d)}(j) \end{aligned}$$
(6)

where \(\mathcal {A}_{l-1}^{(q,d)}\) is the attention matrix learned in the \(l-1\)-th layer and \(\mathbf {E}_{0}^{(q,d)}=\mathbf {I}^{(q,d)}\). Through this layer, we obtain L attention matrices \(B=\{\mathcal {A}_l^{(q,d)}\}_{l=1}^L\) for each query–document pair (qd). Each attention matrix \(\mathcal {A}_{l}^{(q,d)}\) naturally models the query–document word interaction.

3.3.2 Word Representation Refinement over Disentangled Graph

To explore the confounding factor in \(\mathcal {A}_l^{(q,d)}\in B\), we find word relations in each self-attention matrix \(\mathcal {A}_l^{(q,d)}\) fall into three categories: document–document, query–query and query–document word relations. Intuitively, only query–document word relations are useful for query–document matching, and other additional interactions may harm the retrieval performance [6]. A simple method is to mask these relations and obtain only a bipartite word graph. However, this is not flexible for different query–document pairs for not all the document words are useful for the matching. Thus, here we propose an adaptive masking method to separate good word relations from spurious relations for retrieval performance. We perform message passing over this disentangled word graph to remove spurious relations’ negative effect on word representations.

The heuristic masking method is visualized as Fig. 4a. Word relations within a query and within a document are removed, and white means there no edges between corresponding nodes. The up-triangle masking matrix of \(\mathcal {M}_l^{(q,d)}\) is obtained from Eq. (7) for each transformer layer l. According to the symmetry of \(\mathcal {M}_l^{(q,d)}\), its down-triangle matrix is filled. Based on this simple masking method, the masked self-attention-like matrix \(G_l^{(q,d)}\) is defined as Eq. (8), where \(\epsilon\) is small enough. ReLU is introduced to filter all possible spurious relations with the word similarity smaller than 0. This is referred to as Adaptive Masking.

$$\begin{aligned}&\mathcal {M}_l^{(q,d)}(i,j)={\left\{ \begin{array}{ll} 1 &{} 1\le i\le m, m+2\le j\le m+n+2\\ 1 &{} i=j\\ 0 &{}\text {otherwise}\\ \end{array}\right. } \end{aligned}$$
(7)
$$\begin{aligned}&G_l^{(q,d)}={\text {ReLU}}\left( \frac{(\mathbf {W}_A\mathbf {E}_l^{(q,d)})(\mathbf {W}_A\mathbf {E}_l^{(q,d)})'}{\sqrt{d_k}}+\epsilon (1-\mathcal {M}_l^{(q,d)})\right) \end{aligned}$$
(8)

To derive the rigorous masked self-attention matrix, we first normalize each element in the self-attention-like matrix \(G^{(q,d)_l}\) with its infinite norm to avoid overflow. Then a modified softmax function \({\text {softmax}}_m (x)\) for a vector \(x\in \mathbb {R}^{1\times n_x}\) in Eq. (9) is introduced to obtain the probability distribution over all other words, where the probability is 0 for the entry 0. In other words, negative word relations should be completely filtered out. The masked self-attention matrix through the adaptive masking method, namely the disentangled word graph \(\hat{\mathcal {A}}_l^{(q,d)}\) for transformer layer l, is defined as Eq. (10).

$$\begin{aligned} {\text {softmax}}_m (x) = \left( \frac{\exp (x_i)-1}{\sum _j (\exp (x_j)-1)}\right) _{1\times n_x} \end{aligned}$$
(9)
$$\begin{aligned} \hat{\mathcal {A}}_l^{(q,d)}={\text {softmax}}_m \left( \frac{G_l^{(q,d)}}{\Vert G_l^{(q,d)}\Vert _{\infty }}\right) \end{aligned}$$
(10)

With the adaptive masking method, the disentangled graph is derived denoted as \(\hat{\mathcal {A}}_l^{(q,d)}\). To distill the useful word representations from all the word representations, we perform message passing over this disentangled graph. The process is called word representation refinement. We use gated graph neural networks (GGNN) [18] to update word representations over the bipartite-core graph \(\hat{\mathcal {A}}_l^{(q,d)}\). At each propagation step t, GGNN aggregates neighbor word representations for each word in the graph \(\hat{\mathcal {A}}_l^{(q,d)}\) and concatenates word representations from the last iteration and from neighborhood aggregation this iteration as the input embedding of gated recurrent unit (GRU) in Eq. (11). This will help utilize high-order word relations to obtain fine-grained representations.

$$\begin{aligned} \mathbf {h}_0^l&= \mathbf {E}_l^{(q,d)}\nonumber \\ \mathbf {h}_t^l&={\text {GRU}}([\mathbf {h}_{t-1},\hat{\mathcal {A}}_l^{(q,d)}\mathbf {h}_{t-1}]) \end{aligned}$$
(11)

After T propagation steps, a final graph level representation for each query–document pair is learned denoted as \(\mathbf {Z}_T^{l}\) for each transformer layer l. Self-attention mechanism is again applied in Eq. (12) to the derived word representations \(\mathbf {Z}_T^l\). The softmax function in Eq. (12) is an approximation probability distribution. Thus, the expectation \(\mathbb {E}_B[X]\) is in proportion to \(\mathbf {Z}^l_T\).

$$\begin{aligned} \mathbb {E}_{B}[X]\propto \mathbf {Z}^{l}={\text {softmax}}((\mathbf {W}_a\mathbf {h}_T^l) \cdot (\mathbf {W}_{h}\mathbf {h}_T^l)')\cdot \mathbf {h}_T^l \end{aligned}$$
(12)

3.3.3 Mutual Information Decomposition Layer

From the perspective of the query–document interaction, word representation refinement layer is proposed to eliminate the spurious query–document word relations through the disentangled word graph. From the perspective of the document representation, not all document words are necessary for the query–document matching process. Naturally, the document word importance is dependent on how the query representation attends to it. Here we introduce a conventional attention mechanism to put more weights on the document words in terms of the query representation. According to this attention mechanism, the document word representation is decomposed into query-related part and its complement. To obtain good decomposed representations, we add mutual information constraints to minimize the overlapped information between two parts. These constraints will be introduced in the loss function section.

$$\begin{aligned} \mathbf {Z}^l=[\mathbf {Z}^l_{\text {CLS}},\mathbf {Z}_q^l,\mathbf {Z}_{\text {SEP}}^l,\mathbf {Z}^l_d,\mathbf {Z}_{\text {SEP}}^l] \end{aligned}$$
(13)

Through the word representation refinement layer, we obtain all word representations as Eq. (13). Based on query word representations \(\mathbf {Z}_q^l\), we use a sigmoid function to decide the probability that this document word is important for the current query word. With this word probability, we split the document word representations \(\mathbf {Z}_d^l\) into query related part \(\mathbf {Z}_{d_r}^l\) as Eq. (14) and query unrelated part \(\mathbf {Z}_{d_n}^l\) as Eq. (15) two parts. For simplicity, we assume only the query related part \(\mathbf {Z}_{d_r}\) has effect on the retrieval performance. So, the target expectation \(\mathbb {E}_Z[Z]\) is calculated as \(1\times \mathbf {Z}_{d_r}^l+0\times \mathbf {Z}_{d_n}^l=\mathbf {Z}_{d_r}^l\).

$$\begin{aligned} \mathbf {Z}_{d_r}^l&= \sigma ((\mathbf {W}_q \mathbf {Z}_q^l)\cdot (\mathbf {W}_d \mathbf {Z}_d^l)')\cdot \mathbf {Z}_d^l \end{aligned}$$
(14)
$$\begin{aligned} \mathbf {Z}_{d_n}^l&= (1-\sigma ((\mathbf {W}_q \mathbf {Z}_q^l)\cdot (\mathbf {W}_d \mathbf {Z}_d^l)'))\cdot \mathbf {Z}_d^l \end{aligned}$$
(15)

3.3.4 Prediction Layer

We add skip connections from BERT layer, word refinement layer and mutual information decomposition layer separately to the prediction layer to avoid unnecessary information loss for prediction. The \(g(\cdot )\) of the final do-operation query P(Y|do(X)) in Eq. (4) is estimated as the linear combination of these aggregated information above in Eq. (16). Then a sigmoid function is employed to estimate P(Y|do(X)) in Eq.(17).

$$\begin{aligned}&g(q,d)=\mathbf {w}_f(\mathbf {W}_s[\mathbf {Z}^{l},\mathbf {Z}_{d_r}^l,\mathbf {E}_l^{(q,d)}(0)]+\mathbf {b}_{s})_{1\times L}+b_{f} \end{aligned}$$
(16)
$$\begin{aligned}&P(Y|do(X))\approx f(q,d)=\sigma \left( g(q,d)\right) \end{aligned}$$
(17)

3.4 Loss Function

To obtain the optimal model parameters, we add the triangle distance loss, decomposition loss and pairwise ranking loss separately to the corresponding transformer layer, decomposition layer and prediction layer.

Fig. 5
figure 5

Illustration of different constraints’ effect on learned query/document representations

3.4.1 Triangle Distance

From the embedding perspective, we propose a triangle distance loss to place constraints on query, document and query–document representations. Cosine distance [17] was first introduced to make examples with different labels separated from each other in the classification problem. Given two samples a and b with representation \(x_a\) and \(x_b\) , respectively, the cosine distance is defined as Eq. (18), where \(\mathbb {I}(a,b)=1\) if a and b have the same label and 0 otherwise.

$$\begin{aligned} s(x_a,x_b)=1+2\mathbb {I}(a,b)\cos (x_a,x_b) \end{aligned}$$
(18)

We split the unified query–document word representations \(\mathbf {E}_L^{(q,d)}\) into query word representations \(\mathbf {E}_L^q\) and document word representations \(\mathbf {E}_L^d\). Moreover, we define the pointwise cosine distance as Eq. (19), which only puts constraints between query and document word representations in Fig. 5a.

$$\begin{aligned} \mathcal {C}_{\text {point}}(q,D_q)=\frac{1}{n_q}\sum _{j=1}^{n_q}s(\mathbf {E}_L^q,\mathbf {E}_L^{d_j}) \end{aligned}$$
(19)

Similarly treating each query–document pair as an instance, we define the distance between query–document representations with different labels as this cosine distance, referred to as pairwise cosine distance. The pairwise cosine distance is computed for transformer and mutual information decomposition layer, respectively, whose query–document representations are \(\mathbf {e}_d^L = \mathbf {E}_L^{(q,d)}(0)\). The distance summation of both layers is shown in Eq. (20). It only puts constraints on query–document representations in Fig. 5b.

$$\begin{aligned} \mathcal {C}_{\text {pair}}(q,D_q)=\sum _{\begin{array}{c} d_+\in D^+_q\\ d_-\in D^-_q \end{array}} \frac{(s(\mathbf {e}^L_{d^+},\mathbf {e}^L_{d^-}) + s(\mathbf {Z}^L_{d_r^+},\mathbf {Z}^L_{d_r^-}))}{2n_q^+n_q^-} \end{aligned}$$
(20)

Neither pairwise nor pointwise distance will produce compact representations for query, document and query–document representations. So, we propose a triangle distance to combine both pairwise and pointwise cosine distance as Eq. (21). As shown in Fig. 5c, this triangle distance places constraints not only on the distance between a query and document representations but also on the distance between different documents.

$$\begin{aligned} \mathcal {L}_{\text {triangle}}(q,D_q)=\mathcal {C}_{\text {point}}(q,D_q)+\mathcal {C}_{\text {pair}}(q,D_q) \end{aligned}$$
(21)

3.4.2 Decomposition Loss

It is reasonable to decompose the document word representations into two parts satisfying the following three conditions: (1) minimizing the interdependency between query related and unrelated document word representations; (2) minimizing the interdependency between query and query unrelated document word representations; (3) maximizing the interdependency between query and query related document word representations.

Here the interdependency is measured by mutual information, which is computed as the KL-divergence between the joint distribution and the production of two marginal distributions. As the marginal distributions are hard to estimated, so we approximate the mutual information in the dual representation of KL-divergence, which is proposed by Mutual Information Neural Estimator (MINE) [2]. Specifically, three constraints in terms of mutual information are expressed as \(\mathcal {L}_{rn}(\mathbf {Z}_{d_r}^L,\mathbf {Z}_{d_n}^L,\phi )\) in Eq. (22), \(\mathcal {L}_{qn}(\mathbf {Z}_{q}^L,\mathbf {Z}_{d_n}^L,\phi )\) in Eq. (23) and \(\mathcal {L}_{qr}(\mathbf {Z}_{q}^L,\mathbf {Z}_{d_r}^L,\phi )\) in Eq. (24) separately. \(\phi\) denotes parameters in the map** function as Eq. (14) from \(\mathbf {Z}_d^L\) to \(\mathbf {Z}_{d_r}^L\) and \(\mathbf {Z}_{d_n}^L\). The overall mutual information constraint \(\mathcal {L}_{mi}(\mathbf {Z}_q^L,\mathbf {Z}_{d_r}^L,\mathbf {Z}_{d_n}^L,\phi )\) is computed as \(\mathcal {L}_{rn}+\mathcal {L}_{qn}-\mathcal {L}_{qr}\).

$$\begin{aligned}&\mathbb {E}_{P(\mathbf {Z}_{d_r}^L,\mathbf {Z}_{d_n}^L)}[\phi ]-\log (\mathbb {E}_{P(\mathbf {Z}_{d_r}^L)P(\mathbf {Z}_{d_n}^L)}[e^{\phi }]) \end{aligned}$$
(22)
$$\begin{aligned}&\mathbb {E}_{P(\mathbf {Z}_{q}^L,\mathbf {Z}_{d_n}^L)}[\phi ]-\log (\mathbb {E}_{P(\mathbf {Z}_{q}^L)P(\mathbf {Z}_{d_n}^L)}[e^{\phi }]) \end{aligned}$$
(23)
$$\begin{aligned}&\mathbb {E}_{P(\mathbf {Z}_{q}^L,\mathbf {Z}_{d_r}^L)}[\phi ]-\log (\mathbb {E}_{P(\mathbf {Z}_{q}^L)P(\mathbf {Z}_{d_r}^L)}[e^{\phi }]) \end{aligned}$$
(24)

3.4.3 Ranking Loss

From the ranking perspective, we introduce a margin-based pairwise ranking loss \(\mathcal {L}_{\text {rank}}(q,D_q)\) as Eq. (25).

$$\begin{aligned} \frac{1}{n_q^+n_q^-}\sum _{\begin{array}{c} d_+\in D_q^+\\ d_-\in D_q^- \end{array}} \max (0,1-f(q,d_+)+f(q,d_-)) \end{aligned}$$
(25)

We train all tasks in a multi-task learning framework with the optimization of \(\lambda (\mathcal {L}_{\text {triangle}}(q,D_q) + \mathcal {L}_{mi}) + \mathcal {L}_{\text {rank}}(q,D_q)\).

Table 1 Statistics of datasets

4 Experiments

We compare our proposed model DGRe with state-of-the-art baselines to investigate its effectiveness on two public benchmark datasets. Moreover, ablation studies for each component of DGRe are also explored.

4.1 Experimental Setting

4.1.1 Datasets

We use two TREC collections, Robust04 and WebTrack 2009–12. Robust04 uses TREC discs 4 and 5,Footnote 1 and WebTrack 2009–12 uses ClueWeb09bFootnote 2 as document collections. Note that the statistics are obtained only from the documents returned by BM25. Both data sets are white-space tokenized, lowercased and stemmed using the Krovetz stemmer. Consistent with the baselines of the corresponding dataset, Robust04 uses IndriFootnote 3 for indexing, and WebTrack2009-12 uses Anserini [34] for indexing. Table 1 provides detailed information on these two data sets.

4.1.2 Baselines

Three kinds of baselines are compared over these two datasets. (1) BM25: Candidate documents for each query are usually generated by BM25 in the first stage ranking. (2) Interaction-based Neural Ranking Models (without BERT): DRMM [10] and ConvKNRM [\(\mathcal {L}_{mi}\). Experimental results are shown in Table 5. Imp.% column corresponds to the relative performance improvement of DGRe compared with DGRe without mutual information regular terms.

Table 5 Ablation study for mutual Information Regularization on Robust04

It is worth noting that the performance gain is at least 0.1% achieved by mutual information regularization in Table 5. The result indicates the regularization is essential for retrieval. The other interesting observation is that the NDCG improvement is higher than the precision improvement. One reason is that the decomposition layer makes the document representations more discriminative especially among the relevant documents. To verify this analysis, we randomly select a query, and plot query and its relevant and irrelevant document’s query related representations through dimension reduction with t-sne [21] shown in Fig. 8.

Fig. 8
figure 8

Query and document representations from DGRe with/without \(\mathcal {L}_{m i}\). The pentagram means the mass center of each group

The qualitative result suggests relevant document points in Fig. 8b that are scattered more widely than those in Fig. 8a, and at the same time, relevant and irrelevant document nodes are well separated. The representation distinctions of relevant documents from DGRe are larger than those from DGRe without mutual information regularization. In other words, the mutual information regularization term makes the relevant document representations more discriminative, which coincides with the comparison result in Table 5.

Table 6 Ranking performance comparisons on two subsets of Robust04 with different query lengths

4.6 Query Length Analysis

As mentioned before, one possible reason for the lower performance on WebTrack 2009–12 is shorter queries. To further explore the effect of query length on the ranking performance of BERT-based ranking models, we conduct a group study on different query lengths. Robust04’s queries are divided into two groups: one group with query length \(\le 3\), the other group with query length \(>3\). The number of queries in two groups is 144 and 106, respectively. We randomly select 100 queries from each group, and randomly divide them into training, validation and test set with a ratio of 8:1:1. Performance comparisons on the test set with vanilla BERT and BM25 are shown in Table 6. Imp.% column represents the relative performance improvement of each other method compared with BM25 and best results are in bold.

For all the methods, absolute performances on the shorter query subset are usually lower than these on the longer query subset. This suggests that document ranking for shorter queries is more difficult. Due to the concatenation of query and document pair as input, BERT models the global word interaction over the query–document text. This helps query words find their related words, which will alleviate the difficult short query problem to some degree. In this sense, both BERT-based ranking models obtain higher performance gain on shorter queries than these on longer queries in Table 6. Due to the addition of the word representation refinement layer and mutual information decomposition layer, DGRe’s relative performance improvement is much higher than vanilla BERT’s. Compared with longer queries, the global word interaction learned from BERT is easier to generate a query–document representation submerging the query information. The refinement process of DGRe makes the query part emerge in the query–document representation.

One interesting observation is that nDCG@20 of DGRe is higher on short queries than it on long queries while P@20 is slightly lower on short queries than that on long queries. This inconsistent result is owing to the small average query length difference between two groups. The major reason lies in the DGRe’s bias toward both short queries and relevant documents. Thus, when P@20 is more or less the same, nDCG@20 will be higher on short queries. Generally, DGRe’s absolute performances on long queries are higher than those on short queries, but DGRe’s performance improvement on long queries is smaller than that on short queries compared with baselines.

5 Conclusion

To reduce the effects of spurious information, we propose to remove useless word relations of BERT and disentangle the query related part of the document representation for the document ranking task, namely DGRe. To alleviate the observable confounder in word pair relations, we make the back-door adjustment on the causal graph and refine the word representations over the disentangled graph generated from our proposed adaptive masking method. To resolve the unobservable confounder in document word representations, we do the front-door adjustment in the causal graph and decompose the document word representations into query related and unrelated parts minimizing the mutual information between them. For optimization, we introduce triangle distance loss function to constrain the transformer and refinement layer and mutual information regularization to penalize the decomposition layer.

Experiments are comprehensively conducted on two public benchmark datasets, and we obtain the following results. (1) Reducing the spurious information’s effects, DGRe outperforms state-of-the-art methods about 2% in terms of P@20 and nDCG@20. (2) Both masking strategies and the mutual information decomposition layer play essential roles in the performance improvement. (3) DGRe mainly prompts performances of short queries.

In the real world applications, two stage ranking paradigm, i.e. retrieval and re-ranking, is common for modern information retrieval systems. Our proposed method DGRe is mainly employed in the re-ranking stage to sort the retrieved documents according to their relevance scores to the query.

Two major limitations of DGRe are considered to be improved. Due to its low computational efficiency, DGRe cannot be directly applied to the retrieval stage. Next, we will try to improve the model efficiency and apply it to the dense retrieval scenario. Simple ReLU function is used for adaptive masking to remove useless word relations, where the decision threshold of useless word relations is fixed for different scenarios. For future work, we will use optimal transportation technique to improve the masking strategy in the transformer layer.