1 Introduction

The setting of dueling bandits (Yue and Joachims 2009; Sui et al. 2018; Bengs et al. 2021) is a variant of the standard multi-armed bandit (MAB) problem, in which the learner is allowed to compare pairs of choice alternatives (arms) in a sequential manner. Thus, instead of repeatedly pulling an arm and observing a numerical reward, the learner pulls two arms and observes the winner of the corresponding duel. Like in the standard MAB problem, this feedback is assumed to be stochastic. A typical task of the learner is to find the “best” arm as quickly as possible, or, more generally, to identify a complete ranking of all arms. There is a variety of practically relevant applications for this learning scenario, such as ranking XBox gamers according to duel outcomes (Guo et al. 2012) or rating different objects based on pairwise preferences of users, which can nowadays be gathered quite conveniently by means of crowdsourcing services such as Amazon Mechanical Turk (Chen et al. 2013).

Relaxed assumptions of transitivity, especially different types of stochastic transitivity (Fishburn 1973), play an important role in this regard: If arm a is likely to be preferred over arm b, and b is likely to be preferred over arm c, then a is also likely to be preferred over c. Assumptions of that kind are important for several reasons. First, they assure that the learning task itself is actually well defined, for example that a naturally “best” arm actually exists. Second, they are on the basis of the design of efficient learning algorithms, which exploit generalized transitivity to reduce sample complexity (Yue and Joachims 2011; Mohajer et al. 2017; Falahatgar et al. 2018). This is comparable to how standard sorting algorithms avoid the comparison of all pairs of items and achieve an \(\mathcal {O}(n \log n)\) (instead of an \(\mathcal {O}(n^2)\)) complexity.

Somewhat surprisingly, the problem of testing the validity of transitivity assumptions underlying various algorithms has not been considered so far. Needless to say, this would be important to guarantee the meaningfulness of the results produced by such algorithms. In fact, if the assumptions made by an algorithm are violated by the data-generating process in a concrete application, then neither its prediction nor any of its guarantees can be trusted anymore. In this paper, we therefore propose a method for testing an important form of transitivity, namely weak stochastic transitivity (WST), in an online manner. Being the weakest type of stochastic transitivity, WST is quite natural to start with. Moreover, weak stochastic transitivity of pairwise preferences (winning probabilities) is a necessary and sufficient condition for the existence of a complete ranking (strict total ordering) of all arms that is consistent with all pairwise preferences.

More specifically, we introduce an algorithmic framework consisting of two main components, namely an active sampling strategy \(\pi\) and a sequential test. In this way, the algorithmic framework covers two conceivable scenarios to online hypothesis testing:

  • The passive online testing scenario, where the sampling strategy \(\pi\) is any dueling bandits algorithm based on a transitivity assumption, and the test component is (passively) monitoring the statistical validity of the transitivity assumption made by the dueling bandits algorithm — in other words, the learning and testing component are working in parallel, independently of each other.

  • The active online testing scenario, in which the sampling strategy \(\pi\) is specifically constructed to support the test component, i.e., to make a test decision as quickly as possible.

In this paper, we introduce the problem of testing different types of stochastic transitivity within a dueling bandits problem (Sect. 4), both with and without the so-called low noise assumption (Korba et al. 2017). We prove that the expected sample complexity for testing different types of stochastic transitivity stronger than WST in an online manner is infinite in the worst case. These results provide an additional theoretical motivation for focusing on WST, as it is the only type of stochastic transitivity that admits finite expected sample complexity for online testing, which can be inferred by an appropriate reduction to the setting of pure exploration bandits with multiple correct answers introduced by Degenne and Koolen (2019) (Sect. 5).

We improve upon the corresponding asymptotic lower bounds on the expected sample complexity for testing WST from the latter reduction by providing instance-wise lower bounds for fixed confidence levels (Sect. 6). For the passive online testing scenario, we construct a test component based on multiple binomial hypothesis tests, for which we show consistency in terms of almost sure termination time and reliability in terms of maintained error bounds under mild assumptions on \(\pi\). For the active online testing variant, we provide a sampling strategy \(\pi ,\) such that the expected sample complexity of the latter test is optimal up to a logarithmic term (Sect. 7). Moreover, by exploiting a connection between WST and graph theory, we suggest an enhancement to further improve the efficiency of the testing algorithm (Sect. 8). The superiority of this variant in the passive setting is illustrated by an empirical evaluation (Sect. 9). The paper starts with a brief account of related work (Sect. 2), followed by a refresher of the dueling bandits problem as well as different types of stochastic transitivity (Sect. 3). Detailed proofs of theoretical results are provided in the supplementary material.

2 Related work

The dueling bandits problem was studied under strong stochastic transitivity in (Yue et al. 2012) and relaxed stochastic transitivity in (Yue and Joachims 2011), in both cases with the goal of regret minimization. In these works, the transitivity assumption is explicitly required for the theoretical guarantees. In other approaches, transitivity properties are assumed in a more indirect way, for example through probabilistic models of the feedback process. This includes the Plackett-Luce model (Luce 1959; Plackett 1975) resp. Bradley-Terry model (Bradley and Terry 1952) considered in (Szörényi et al. 2015) resp. (Maystre and Grossglauser 2017), as well as the Mallows model (Mallows 1957) studied in (Busa-Fekete et al. 2014). Mohajer et al. (2017) consider the goals of finding the best arm as well as the (top-k-)ranking of arms under WST, while Falahatgar et al. (2017a, 2017b, 2018) investigate the impact of various transitivity assumptions on these goals in an online PAC-framework. Finally, transitivity assumptions were also analyzed in batch learning scenarios, for example to estimate the underlying pairwise preference relation (Shah et al. 2016), or for the purpose of rank aggregation (Korba et al. 2017).

The literature on testing transitivity conditions is primarily rooted in the social sciences, psychology, and economics, with a special focus on experimental studies for real data. The only mathematical treatment we found is (Iverson and Falmagne 1985), where the authors provide an asymptotic likelihood-ratio test for WST. The use of Bayes factors for testing stochastic transitivity is proposed in (Cavagnaro and Davis-Stober 2014) . In (McNamara and Diwadkar 1997) and (Waite 2001), multiple binomial tests are conducted to test WST of preferences in different field studies. From a methodological point of view, this is closest to the sequential testing approach put forward in this paper. Yet, all these works are settled in classical hypothesis testing, assuming all the data to be available beforehand. In contrast to this, the focus of this paper is on hypothesis testing in an online setting, where data arrives sequentially, and test decisions should be taken as quickly as possible while maintaining a predefined level of confidence.

As already mentioned in the introduction the problem of testing stochastic transitivity in an online manner can be tackled by a suitable reduction to the pure exploration bandits with multiple correct answers introduced by Degenne and Koolen (2019), which will be discussed more thoroughly in Sect. 5.

3 Theoretical background

In this section, we concisely recall the main theoretical foundations needed throughout the paper. In the supplementary material, we provide a list of symbols used in the paper for the sake of convenience.

3.1 Dueling bandits

Consider a finite set of m arms identified by the index set \([m] :=\{1,\dots ,m\}\). In the setting of the dueling bandits problem, two distinct arms \(i,j\in [m]\) can be compared with each other at each time step \(t\in {\mathbb {N}}\). Querying a pairwise preference, the learner is provided with binary feedback about the winner of the duel, which is assumed to be generated by a time-stationary iid probabilistic process. The probability \({\mathbb {P}}(i \succ j)\) that arm i wins against arm j is given by some underlying (unknown) ground truth parameter \(q_{i,j} \in [0,1].\) We suppose that ties are not possible. Thus (assuming w.l.o.g. \(q_{i,i} = \frac{1}{2}\) for every \(i\in [m]\)), we can infer that \({\mathbf {Q}} = (q_{i,j})_{1\le i,j\le m}\) is a reciprocal relation on [m], i.e., \({\mathbf {Q}}\) is an element of

$$\begin{aligned} \mathcal {Q}_{m} :=\Big \{ {\varvec{ Q }}=(q_{i,j})_{1\le i,j\le m} \in [0,1]^{m\times m} \, \vert \, q_{j,i} = 1-q_{i,j} \text { for every } i,j\in [m] \Big \}. \end{aligned}$$

To assimilate the information available at time \(t\in {\mathbb {N}}\), let us write \(({\varvec{ n }}_{t})_{i,j}\) for the number of comparisons between i and j until time t, and \(({\varvec{ w }}_{t})_{i,j}\) for the number of times i has won against j until time t. This obviously implies \(({\varvec{ w }}_{t})_{i,j}+({\varvec{ w }}_{t})_{j,i}=({\varvec{ n }}_{t})_{i,j}\) and \(({\varvec{ n }}_{t})_{i,j}=({\varvec{ n }}_{t})_{j,i}\). Then, \({\varvec{ n }}_{t} = (({\varvec{ n }}_{t})_{i,j})_{1\le i,j\le m}\) is a symmetric integer-valued matrix with zeros on its diagonal. If \({\varvec{ w }}\in {\mathbb {N}}_{0}^{m\times m}\) and \({\varvec{ n }}\in {\mathbb {N}}^{m\times m}_{0}\), we denote the matrix \((\frac{w_{i,j}}{n_{i,j}})_{1\le i,j\le m} \in [0,1]^{m\times m}\) by \(\frac{{\varvec{ w }}}{{\varvec{ n }}}\), where we define for convenience \(\frac{x}{0} :=\frac{1}{2}\) for any \(x\in {\mathbb {N}}_{0}\). Moreover, we write \([m]_{2}\) for the set containing all subsets of size 2 of [m] and \((m)_{2}\) for the set of all \((i,j) \in [m]\times [m]\) with \(i<j\). A specific learning algorithm in the realm of dueling bandits can be identified by a sampling strategy as defined in the following.

Definition 3.1

A sampling strategy \(\pi\) is a family of random map**s, which, depending on the time t and the observations \({\varvec{ n }}_{0},{\varvec{ w }}_{0},\dots , {\varvec{ n }}_{t-1},{\varvec{ w }}_{t-1}\) available before time t, determines the two distinct arms \(i(t),j(t) \in [m]\) that are to be compared at time \(t\in {\mathbb {N}}\). Let \(\varPi\) denote the set of all sampling strategies, while \(\varPi _{\infty }\) denotes the family of sampling strategies \(\pi\) that sample every pair \(\{i,j\}\) almost surely (a.s.) infinitely often, which means that \(({\varvec{ n }}_{t})_{i,j} \,\rightarrow \,\infty\) a.s. as \(t\,\rightarrow \,\infty\).

Note that if \(\pi \in \varPi \setminus \varPi _{\infty }\), then a sampling strategy \({\hat{\pi }} \in \varPi\) that chooses the same pair as \(\pi\) in each time step with probability \(1-1/t\), and otherwise (i.e., with probability 1/t) picks a pair \(\{i,j\}\) uniformly at random from \([m]_{2},\) fulfills \({\hat{\pi }} \in \varPi _{\infty }\) and

$$\begin{aligned} {\mathbb {P}}\big (\pi (t,({\varvec{ n }}_{t'},{\varvec{ w }}_{t'} )_{0\le t'\le t-1}) \not = {\hat{\pi }}(t,({\varvec{ n }}_{t'},{\varvec{ w }}_{t'})_{0\le t'\le t-1}\big )\le \frac{1}{t} \,\rightarrow \,0 \text { as } t\,\rightarrow \,\infty \, . \end{aligned}$$

Thus, \({\hat{\pi }}\) and \(\pi\) behave similarly in the limit. This shows that the assumption \(\pi \in \varPi _{\infty }\), which is required for theoretical results in our framework, is rather mild.

3.2 Stochastic transitivity

Different types of stochastic transitivity have been used in the realm of dueling bandits problems (Bengs et al. 2021), mainly because they provide a certain degree of regularity of the reciprocal relations in \(\mathcal {Q}_{m}\), and thereby facilitate learning. In particular, the following transitivities are commonly considered in the literature.

Definition 3.2

A reciprocal relation \({\varvec{ Q }}=(q_{i,j})_{1\le i,j\le m} \in \mathcal {Q}_{m}\) is said to satisfy

  1. (i)

    weak stochastic transitivity \(\mathrm {(WST)}\) iff

    $$\begin{aligned} \big ( q_{i,j}\ge 1/2 \wedge q_{j,k}\ge 1/2\big ) \Rightarrow q_{i,k}\ge 1/2 \, , \end{aligned}$$
  2. (ii)

    moderate stochastic transitivity \(\mathrm {(MST)}\) iff

    $$\begin{aligned} \big ( q_{i,j}\ge 1/2 \wedge q_{j,k}\ge 1/2\big ) \Rightarrow q_{i,k}\ge \min (q_{i,j},q_{j,k}) \, , \end{aligned}$$
  3. (iii)

    \(\nu\)-relaxed stochastic transitivity \({(\nu -RST)}\) for some \(\nu \in (0,1)\) iff

    $$\begin{aligned} \big ( q_{i,j}\ge 1/2 \wedge q_{j,k}\ge 1/2\big ) \Rightarrow q_{i,k}\ge \nu \max (q_{i,j},q_{j,k}) +(1-\nu )/2 \, , \end{aligned}$$
  4. (iv)

    strong stochastic transitivity \(\mathrm {(SST)}\) iff

    $$\begin{aligned} \big ( q_{i,j}\ge 1/2 \wedge q_{j,k}\ge 1/2\big ) \Rightarrow q_{i,k}\ge \max (q_{i,j},q_{j,k}) \, , \end{aligned}$$

where all previous conditions must hold for all distinct \(i,j,k\in [m]\).

The set consisting of all stochastic transitive reciprocal relations of a certain type is

$$\begin{aligned} \mathcal {Q}_{m}(\rm {XST}) :=\{{\mathbf {Q}}\in \mathcal {Q}_{m} \, | \, {\mathbf {Q}} \text { is XST}\}, \quad \rm {XST}\in \{\rm {WST},\rm {MST}, {\nu -RST},\rm {SST}\} , \end{aligned}$$

and we write \(\mathcal {Q}_{m}(\lnot \mathrm {XST}) :=\mathcal {Q}_{m} \setminus \mathcal {Q}_{m}(\mathrm {XST})\). The following relationships hold between the different types of stochastic transitivities:

$$\begin{aligned} \mathcal {Q}_{m}(\mathrm {SST}) \subsetneq \mathcal {Q}_{m}(\mathrm {MST}) \subsetneq \mathcal {Q}_{m}({\mathrm {WST}}), \ \ \mathcal {Q}_{m}(\mathrm {SST}) \subsetneq \mathcal {Q}_{m}({\nu -RST}) \subsetneq \mathcal {Q}_{m}({\mathrm {WST}}) \, , \end{aligned}$$

but neither \(\mathcal {Q}_{m}({\nu -RST}) \subseteq \mathcal {Q}_{m}(\mathrm {MST})\) nor \(\mathcal {Q}_{m}(\mathrm {MST}) \subseteq \mathcal {Q}_{m}({\nu -RST})\).

3.3 Violations of WST

To illustrate the issues that may arise in case of a violation of the WST assumption, and highlight the importance of testing such assumptions, consider algorithms that are based on the idea of (noisy) sorting (Szörényi et al. 2015; Mohajer et al. 2017). Roughly speaking, the active sampling strategies underlying such algorithms mimic the behavior of sorting algorithms, such as merge sort or quicksort — with the main difference that, due to the assumed stochasticity, deciding the order between two arms may require repeated comparisons.

Obviously, weak stochastic transitivity is the least assumption required by such algorithms. On the other side, it is easy to see that a sorting-based algorithm will always return a complete ranking (with high confidence), regardless of whether the underlying relation contains preferential cycles or not. Yet, this ranking will strongly depend on the order in which the arms are compared, and hence be more or less random and therefore meaningless.

4 Online transitivity testing

We focus on the following testing problem in the context of an underlying dueling bandits problem:

$$\begin{aligned} {\mathbf {H}}_{0}: {\mathbf {Q}} \text { satisfies } \mathrm {XST} \quad {\mathbf {H}}_{1}: {\mathbf {Q}} \text { does not satisfy } \mathrm {XST}, \end{aligned}$$
(1)

where \(\mathrm {XST}\in \{{\mathrm {WST}},\mathrm {MST}, {\nu -RST},\mathrm {SST}\}.\) This test shall be conducted for different types of transitivity in an online manner.

Thus, it is natural to consider sequential hypothesis tests, in which a test decision can be provided at any time during the data generating process. The particular choice of the null hypothesis is motivated by the passive scenario, in which a learning algorithm assumes \(\mathrm {XST}\) to be fulfilled and the test shall detect a possible violation thereof. As we focus on tests with guarantees on both, its type I and the type II error, it is possible to swap \({\mathbf {H}}_{0}\) and \({\mathbf {H}}_{1}\), and still obtain qualitatively the same theoretical results as below.

In the course of the paper, we focus on algorithms \(\mathcal {A}\) for the testing problem, which might be probabilistic and interact with the underlying dueling bandits environment, as stipulated by the definition of a sampling strategy \(\pi\) (Definition 3.1). In case an algorithm \(\mathcal {A}\) terminates, it returns a decision denoted by \({\mathbf {D}}(\mathcal {A}) \in \{\mathrm {XST}, \lnot \mathrm {XST} \}\) with the semantic \({\mathbf {D}}(\mathcal {A})=\mathrm {XST}\) resp. \({\mathbf {D}}(\mathcal {A})=\lnot \mathrm {XST}\) indicate that \(\mathcal {A}\) predicts that \(\mathrm {XST}\) holds resp. is violated. Moreover, we denote by \(T^{\mathcal {A}}\) the sample complexity of an algorithm \(\mathcal {A}\), i.e., the number of pairwise comparisons \(\mathcal {A}\) has made before termination.

For our theoretical analysis of the testing problem, we will consider the following set of relations:

$$\begin{aligned} \mathcal {Q}_{m}^{h} :=\big \{ {\mathbf {Q}}=(q_{i,j})_{1\le i,j\le m} \in \mathcal {Q}_{m} \, | \, | q_{i,j}-1/2 | > h \text { for all distinct } i,j\in [m] \big \} \, , \end{aligned}$$

where \(h\in [0,1/2)\). In case \(h>0\), the relations in \(\mathcal {Q}_{m}^{h}\) are said to satisfy the low noise assumption (Korba et al. 2017). Here, the parameter h determines to some extent the complexity of the testing problem: For instance, the larger h, the easier it becomes to determine the sign of \(q_{i,j}-1/2,\) which in turn facilitates checking \({\mathrm {WST}}\). For \(\mathrm {XST} \in \{{\mathrm {WST}},\mathrm {MST},\mathrm {SST}, {\nu -RST}\}\) and any \(h\in [0,1/2)\), we define

$$\begin{aligned} \mathcal {Q}_{m}^{h}(\mathrm {XST}) :=\mathcal {Q}_{m}^{h} \cap \mathcal {Q}_{m}(\mathrm {XST}) \quad \text { and } \quad \mathcal {Q}_{m}^{h}(\lnot \mathrm {XST}) :=\mathcal {Q}_{m}^{h} \cap \mathcal {Q}_{m}(\lnot \mathrm {XST}) \, . \end{aligned}$$

Moreover, we may regard \(\mathcal {Q}_{m}\) as a subset of \({\mathbb {R}}^{m(m-1)/2}\) and, in this way, equip it with the standard Euclidean topology of \({\mathbb {R}}^{m(m-1)/2 }\). Therefore, for a subset \(\mathcal {Q}_{m}' \subseteq \mathcal {Q}_{m}\), we use the standard notation \(\partial \mathcal {Q}_{m}'\) for the boundary of \(\mathcal {Q}_{m}'\) as a subset of this topological space \(\mathcal {Q}_{m}\). The notion of a solution to the \(\mathrm {XST}\)-testing problem is stated in the following.

Definition 4.1

For given \(h\in [0,1/2)\) and error probabilities \(\alpha ,\beta \in (0,1),\) we say that an algorithmFootnote 1\(\mathcal {A}\) solves the \(\mathrm {XST}\)-testing problem on \(\mathcal {Q}_{m}^{h}\) for \(\alpha\) and \(\beta\) (in short: \(\mathcal {A}\) solves \({\mathcal {P}_{\mathrm { XST}}^{m,h,\alpha ,\beta }}\)) if \(T^{\mathcal {A}}\) is almost surely finite on any instance \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}\) and the following holds:

$$\begin{aligned} \begin{aligned} \inf \nolimits _{{\varvec{ Q }}\in \mathcal {Q}_{m}^{h}(\mathrm {XST})} {\mathbb {P}}_{{\varvec{ Q }}}({\mathbf {D}}(\mathcal {A}) = \mathrm {XST})&\ge 1-\alpha \\ \text { and } \inf \nolimits _{{\varvec{ Q }}\in \mathcal {Q}_{m}^{h}(\lnot \mathrm {XST})} {\mathbb {P}}_{{\varvec{ Q }}}({\mathbf {D}}(\mathcal {A}) = \lnot \, \mathrm {XST})&\ge 1-\beta . \end{aligned} \end{aligned}$$
(2)

Interestingly, as the following theorem reveals, the testing problem (1) for a stochastic type of transitivity stronger than \({\mathrm {WST}}\) turns out be too difficult. Hence, we will focus on the case \(\mathrm {XST}={\mathrm {WST}}\) in the rest of the paper.

Theorem 4.2

Let \(h,\alpha , \beta \in (0,1/2)\), \(m\in {\mathbb {N}}_{\ge 3}\) and \(\mathrm {XST} \in \{\mathrm {MST},\mathrm {SST}, {\nu -RST}\}\) be fixed. If an algorithm \(\mathcal {A}\) solves \({\mathcal {P}_{\mathrm { XST}}^{m,h,\alpha ,\beta }}\), then \({\mathbb {E}}_{{\varvec{ Q }}}[T^{\mathcal {A}}] = \infty\) for any \({\mathbf {Q}} \in \mathcal {Q}_{m}^{h}(\mathrm {XST}) \cap \partial \mathcal {Q}_{m}^{h}(\lnot \mathrm {XST})\not = \emptyset\). In particular, we have \(\sup \nolimits _{{\varvec{ Q }} \in \mathcal {Q}_{m}^{h}} {\mathbb {E}}_{{\varvec{ Q }}}[T^{\mathcal {A}}] = \infty .\)

To prove this theorem, we show that any solution \(\mathcal {A}\) to \({\mathcal {P}_{\mathrm { XST}}^{m,h,\alpha ,\beta }}\) may be used to test, for some \(p_{0}\in [0,1]\), any \(p_{1}>p_{0}\), and with an error probability of at most \(\max \{\alpha ,\beta \}\) whether a coin \(C\sim {\mathrm {Ber}( p )}\) has bias \(p=p_{0}\) or \(p=p_{1}\). But if \(p_1\) converges to \(p_0\), the number of coin flips necessary to maintain the error probability tends to infinity in expectation. A detailed proof of the theorem is provided in Section B in the supplement.

5 Reduction to Pure Exploration Bandits with Multiple Correct Answers

The testing problem at hand may be reduced to the Pure Exploration Bandits scenario with multiple correct answers as presented by Degenne and Koolen (2019), the details of which can be found in Section F of the supplement. This approach leads to the following results: If \(\mathcal {A}(\gamma )\) solves \({\mathcal {P}_{\mathrm { WST}}^{m,h,\gamma ,\gamma }}\), then for some (known) constant \(D_{m}^{h}({\mathbf {Q}})>0\),

$$\begin{aligned} \liminf _{\gamma \,\rightarrow \,0} \frac{{\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}(\gamma )}]}{\ln (\gamma ^{-1})} \ge \frac{1}{D_{m}^{h}({\mathbf {Q}})} \, , \end{aligned}$$
(3)

and there exists a solution \(\mathcal {A}(\gamma )\) to \({\mathcal {P}_{\mathrm { WST}}^{m,h,\gamma ,\gamma }}\) with

$$\begin{aligned} \lim _{\gamma \,\rightarrow \,0} \frac{{\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}(\gamma )}]}{\ln (\gamma ^{-1})} \le \frac{1}{D_{m}^{h}({\mathbf {Q}})} . \end{aligned}$$
(4)

If \({\mathbf {Q}} \in \mathcal {Q}_{m}(\mathrm {X})\) (for \(\mathrm {X} \in \{{\mathrm {WST}}, \lnot {\mathrm {WST}}\}\)), the complexity term \(D_{m}^{h}({\mathbf {Q}})\) is given as

$$\begin{aligned} \sup \nolimits _{{\mathbf {v}} \in \varDelta _{(m)_{2}}} \inf \nolimits _{\mathbf {Q'} \in \mathcal {Q}_{m}^{h}(\lnot \mathrm {X})} \sum \nolimits _{(i,j) \in (m)_{2}} v_{i,j} d_{\mathrm {KL}}(q_{i,j},q'_{i,j}) \, , \end{aligned}$$

where \(\varDelta _{(m)_{2}}\) is the set of all \({\mathbf {v}} = (v_{i,j})_{1\le i<j\le m}\) with \(\min _{i<j} v_{i,j} \ge 0\) and \(\sum _{i<j} v_{i,j} = 1\), and \(d_{\mathrm {KL}}(p,q) = p\ln (p/q)+(1-p)\ln ((1-p)/(1-q))\) is the KL-divergence between two Bernoulli distributions with success probability p resp. q. We prove in the supplementFootnote 2 (cf. Lemma F.7) that

$$\begin{aligned} \frac{1/4-h^{2}}{192} \left( {\begin{array}{c}m\\ 2\end{array}}\right) h^{-2} \le \sup \nolimits _{{\mathbf {Q}} \in \mathcal {Q}_{m}^{h}\mathrm {(WST)}} \frac{1}{D_{m}^{h}({\mathbf {Q}})} \le \sup \nolimits _{{\mathbf {Q}} \in \mathcal {Q}_{m}^{h}} \frac{1}{D_{m}^{h}({\mathbf {Q}})} \le \frac{1}{8} \left( {\begin{array}{c}m\\ 2\end{array}}\right) h^{-2} \end{aligned}$$

and

$$\begin{aligned} \frac{1}{192} \left( {\begin{array}{c}m\\ 2\end{array}}\right) h^{-2} \le \sup \nolimits _{{\mathbf {Q}} \in \mathcal {Q}_{m}^{h}\mathrm {(WST)}} \frac{1}{D_{m}^{0}({\mathbf {Q}})} \le \sup \nolimits _{{\mathbf {Q}} \in \mathcal {Q}_{m}^{h}} \frac{1}{D_{m}^{0}({\mathbf {Q}})} \le \frac{1}{2} \left( {\begin{array}{c}m\\ 2\end{array}}\right) h^{-2} \end{aligned}$$

hold for all \(h\in (0,1/2)\). This indicates that the case \(h=0\) is more complex than the case \(h>0\) and shows that any optimal solution \(\mathcal {A}(\gamma )\) to \({\mathcal {P}_{\mathrm { WST}}^{m,h,\gamma ,\gamma }}\) or \({\mathcal {P}_{\mathrm { WST}}^{m,0,\gamma ,\gamma }}\) fulfills

$$\begin{aligned} \sup \nolimits _{{\mathbf {Q}} \in \mathcal {Q}_{m}^{h}} \lim \nolimits _{\gamma \,\rightarrow \,0} \frac{{\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}(\gamma )}]}{\ln (\gamma ^{-1})} \in \varTheta (m^{2}h^{-2}), \end{aligned}$$
(5)

respectively, as \(\max \{m,h^{-1}\} \,\rightarrow \,\infty\). Unfortunately, these results do not yield any information on the case where \(\gamma\) is fixed. Moreover, the algorithmic solution \(\mathcal {A}(\gamma )\) presented by Degenne and Koolen (2019) is very inefficient for the problem of testing \({\mathrm {WST}}\), if not infeasible in practice, which is due to a hard min-max problem that has to be solved at each time step (cf. Remark F.1). In the following, we will discuss further lower and upper bounds on the worst-case sample complexity of solutions to \(\mathcal {P}_{{\mathrm {WST}}}^{m,h,\alpha ,\beta }\). Our results are to some extent stronger than (3) and (4), as they are covering the cases of a fixed confidence level \(\gamma ,\) which in turn corresponds to the typical setting of (online) testing.

6 Lower bounds for online testing of weak stochastic transitivity

In this section, we provide lower bounds on the expected termination time of any algorithm solving \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\). Similarly to Theorem 4.2, these results are obtained by reducing a testing problem for the biases of independent coins to \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\). A sample complexity analysis of the latter testing problem results in the bounds stated below, the proof of which can again be found in Section B.

In order to state an instance-wise lower bound for the case \(h>0\), let us introduce some more notation: Given \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}\), we write \(\sigma _{{\mathbf {Q}}}\) for a permutation on [m], which fulfills \(q_{\sigma _{{\mathbf {Q}}}(i),\sigma _{{\mathbf {Q}}}(i+1)} > 1/2\) for every \(i\in [m]\). We show in the appendix (Lemma B.1) that \(\sigma _{{\mathbf {Q}}}\) exists for every \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}\), even though we only need this for every \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}({\mathrm {WST}})\). In case \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}({\mathrm {WST}})\), \(\sigma _{{\mathbf {Q}}}\) is the underlying ground-truth ranking of \({\mathbf {Q}}\), and permuting rows and columns according to \(\sigma _{{\mathbf {Q}}}\) results in a reciprocal relation with entries \(>1/2\) above the diagonal.

Theorem 6.1

Let \(h_{0},\gamma _{0} \in (0,1/2)\) be fixed, \(h\in (0,h_{0})\), \(\alpha , \beta \in (0,\gamma _{0})\) and \(m\in {\mathbb {N}}_{\ge 3}.\) Suppose \(\mathcal {A}\) is an algorithm that solves \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\), and let \({\mathbf {Q}} \in \mathcal {Q}_{m}^{h}({\mathrm {WST}})\) be arbitrary. Define \(h_{i,j} :=|q_{i,j}-1/2|\) for every distinct \(i,j\in [m]\), \(\gamma :=\min \{\alpha ,\beta \}\), and \(\sigma = \sigma _{{\mathbf {Q}}}.\) Then, there exists a constant \(c=c(h_{0},\gamma _{0})>0\) such that

$$\begin{aligned} {\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}}] \ge c \ln \left( \gamma ^{-1} \right) \sum \nolimits _{1\le i<j-1<m} h_{\sigma (i),\sigma (j)}^{-2} \ge c \left( {\begin{array}{c}m-1\\ 2\end{array}}\right) \ln \left( \gamma ^{-1}\right) h^{-2}. \end{aligned}$$
(6)

Thus, \(\sup _{{\varvec{ Q }} \in \mathcal {Q}_{m}^{h}} {\mathbb {E}}_{{\varvec{ Q }}}[T^{\mathcal {A}}]\) is in \(\varOmega (m^{2}h^{-2} \ln \gamma ^{-1})\) as \(\max \{m,\gamma ^{-1},h^{-1}\} \,\rightarrow \,\infty\).

Note that the right-hand side of (6) is of the order \(m^{2}h^{-2}\ln (\gamma ^{-1})\), which is coherent with (5). The fact that the instance-wise bound only depends on \(\left( {\begin{array}{c}m-1\\ 2\end{array}}\right)\) instead of all \(\left( {\begin{array}{c}m\\ 2\end{array}}\right)\) entries of \({\mathbf {Q}}\) is due to our proof technique, which is nonetheless of the same order with respect to m.

Let us now consider the more complex case \(h=0\). As any solution to \({\mathcal {P}_{\mathrm { WST}}^{m,0,\alpha ,\beta }}\) is also a solution to \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\) for any \(h\in (0,1/2)\), Theorem 6.1 is applicable in this case. However, we can slightly improve upon this. In the following, for functions \(f,g: X \,\rightarrow \,(0,\infty )\), we say that \(f\in \varOmega _{\sup }(g)\) as \(x\,\rightarrow \,x_{0}\) if \(\limsup _{x\,\rightarrow \,x_{0}} \frac{g(x)}{f(x)} <\infty\).

Theorem 6.2

Let \(\alpha , \beta \in (0,1/2)\) be fixed and suppose \(\mathcal {A}\) to be an algorithm that solves \({\mathcal {P}_{\mathrm { WST}}^{m,0,\alpha ,\beta }}\). Then, the following holds:

  1. (a)

    \({\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}}] = \infty\) for any \({\mathbf {Q}}\) in a set \(\emptyset \not = {\mathcal{Q}}_{m}^{\dagger } \subsetneq \partial {\mathcal{Q}}_{m}({\mathrm{WST}}) \cap \partial {\mathcal{Q}}_{m}(\lnot {\mathrm{WST}})\),

  2. (b)

    \(\sup _{{\mathbf {Q}} \in \mathcal {Q}_{m}^{h}}{\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}}] \in \varOmega (m^{2} h^{-2}) \cap \varOmega _{\sup }(h^{-2} \ln \ln h^{-1})\) as \(\max \{m,h^{-1}\} \,\rightarrow \,\infty\).

As we point out in the proof of this theorem, the set \(\mathcal {Q}_{m}^{\dagger }\) in (a) can be chosen as the set of all \({\mathbf {Q}} \in \mathcal {Q}_{m}\), for which some permutation \(\sigma\) on [m] exists such that the following conditions are fulfilled:

$$\begin{aligned} \forall 1\le i<j\le m&: \, q_{\sigma (i),\sigma (j)} \ge 1/2, \\ \forall i\in [m-1]&: \, q_{\sigma (i),\sigma (i+1)}>1/2, \\ \exists 1\le i'<j'-1\le m-1&: \, q_{\sigma (i'),\sigma (j')} = 1/2. \end{aligned}$$

In the proof of the theorem, to make (b) more explicit, we provide several examples for a family \(\{{\mathbf {Q}}(h)\}_{h\in (0,1/2)} \subseteq \mathcal {Q}_{m}^{h}({\mathrm {WST}})\), for which

$$\begin{aligned} \limsup _{h\searrow 0} {\mathbb {E}}_{{\mathbf {Q}}(h)}[T^{\mathcal {A}}]/(h^{-2} \ln \ln (h^{-1})) \ge (1-2\gamma )/2. \end{aligned}$$

Regarding the occurrence of the limes superior in Lemma A.2, this is the best we may infer from Lemma A.2.

At first sight, part (b) of Theorem 6.2 may appear to contradict (5), which does not involve a \(\ln \ln h^{-1}\)-factor. However, note that (5) only yields a bound on the worst-case of the asymptotic of \(\frac{{\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}(\gamma )}]}{\ln (\gamma ^{-1})}\) as \(\gamma \searrow 0\), whereas our bound holds for any fixed \(\gamma\).Footnote 3 Thus, there is actually no contradiction.

7 Online testing of WST

Guided by our findings in Sect. 6, we now focus on the testing problem (1) for \({\mathrm {WST}}\) in the framework developed in Sect. 4. Note that weak stochastic transitivity is in any case of particular interest for the ranking problem in dueling bandits, as it is both a sufficient and a necessary condition for the existence of a ranking over the arms consistent with the preference relation \({\mathbf {Q}}\), in the sense that an arm i is preferred over an arm j if and only if \(q_{i,j}\ge 1/2\).

figure a

A first naïve approach for a testing component for the passive scenario (cf. Section 1) is Algorithm 1, which does the following: Terminate as soon as we can decide, for every \((i,j)\in (m)_{2}\), each with error probability at most \(\gamma ' = \min \{\alpha ,\beta \}\left( {\begin{array}{c}m\\ 2\end{array}}\right) ^{-1}\), whether \(q_{i,j}>1/2\) or \(q_{i,j}<1/2\) holds, and output \({\mathrm {WST}}\) if an auxiliary relation \(\mathbf {Q'}\) generated during runtime is \({\mathrm {WST}}\), and \(\lnot {\mathrm {WST}}\) otherwise. To construct \(\mathbf {Q'}\), the value \(q'_{i,j}\) is set to 1 resp. 0 whenever we are sure enough (for the first time) that \(q_{i,j}>1/2\) resp. \(q_{i,j}<1/2\) holds. Here, testing the sign of \(q_{i,j}-1/2\) with confidence level \(\gamma\) may be done by stop** as soon as \({({\varvec{ w }}_{t})_{i,j}}/{({\varvec{ n }}_{t})_{i,j}}\) leaves the interval \([1/2-C(({\varvec{ n }}_{t})_{i,j}),1/2 + C(({\varvec{ n }}_{t})_{i,j})],\) where \(C(\cdot )\) is an appropriate any-time confidence bound for \({({\varvec{ w }}_{t})_{i,j}}/{({\varvec{ n }}_{t})_{i,j}}.\) The term appropriate is specified in Definition 7.1 below.

In the initialization step of \(\mathcal {A}_{\mathrm {naive}}\), we inform the algorithm about how often every item i has already been compared to every other item j before the start, denoted by \(({\varvec{ n }}_{0})_{i,j}\), and how often i has won against j, denoted by \(({\varvec{ w }}_{0})_{i,j}\). Our setting allows us to assume that \(({\varvec{ w }}_{0})_{i,j} \sim {\mathrm {Bin}( ({\varvec{ n }}_{0})_{i,j} , q_{i,j} )}\) for all \(1\le i<j\le m\). As the theoretical results do not depend on the explicit choice of \({\varvec{ n }}_{0}\) and \({\varvec{ w }}_{0}\), we assume w.l.o.g. that \(({\varvec{ n }}_{0})_{i,j}=1\) for all distinct \(i,j\in [m]\) throughout the paper.

Definition 7.1

For any \(p\in [0,1]\), suppose \(\{X_{n}^{(p)}\}_{n\in {\mathbb {N}}}\) to be a family of iid random variables with distribution \({\mathrm {Ber}( p )}\). We say that a function \(C:{\mathbb {N}}\,\rightarrow \,[0,\infty ]\) is \((h,\gamma )\)-correct for given \(h\in [0,1/2)\) and \(\gamma \in (0,1/2)\), if the following holds:

  1. (a)

    For any \(p\not =1/2\), the following stop** time is almost surely finite:

    $$\begin{aligned} \mathcal {N}^{(p)} :=\mathcal {N}^{(p)}(C) :=\min \left\{ n \in {\mathbb {N}}\, : \, \frac{1}{n} \sum \nolimits _{k=1}^{n} X_{k}^{(p)} \not \in [1/2-C(n),1/2+C(n)] \right\} . \end{aligned}$$
  2. (b)

    For all \(p>1/2+h\), we have

    $$\begin{aligned} {\mathbb {P}}\Big ( \frac{1}{\mathcal {N}^{(p)}} \sum \nolimits _{k=1}^{\mathcal {N}^{(p)}} X_{k}^{(p)} < 1/2 -C \left( \mathcal {N}^{(p)}\right) \Big ) \le \gamma \ , \end{aligned}$$

    and similarly for all \(p<1/2-h\),

    $$\begin{aligned} {\mathbb {P}}\Big ( \frac{1}{\mathcal {N}^{(p)}} \sum \nolimits _{k=1}^{\mathcal {N}^{(p)}} X_{k}^{(p)} > 1/2 +C\left( \mathcal {N}^{(p)} \right) \Big ) \le \gamma . \end{aligned}$$

In case \(h>0\), a first example for an \((h,\gamma )\)-correct function \(C_{h,\gamma }\) can be inferred from Hoeffding’s inequality, by means of

$$\begin{aligned} C_{h,\gamma }^{\mathrm {Hoeffding}}(n) :={\left\{ \begin{array}{ll} 1/2 ,\quad &{}\text { if } n \le \lceil h^{-2} \ln (\gamma ^{-1}) /2 \rceil \\ 0,\quad &{}\text { otherwise} \end{array}\right. } \, . \end{aligned}$$
(7)

With this, the decision whether \(q_{i,j}>1/2\) or \(q_{i,j}<1/2\) is not made in a sequential manner, but instead after exactly \(\lceil h^{-2} \ln (\gamma ^{-1}) /2 \rceil\) duels of i and j have been conducted. At the end of this section, we will introduce more sophisticated any-time confidence bounds admitting decisions in a sequential manner, and also treat the case \(h=0\).

Theorem 7.2

Let \(m\in {\mathbb {N}}_{\ge 3}\), \(\alpha ,\beta \in (0,1)\), and \(h\in (0,1/2)\) be fixed, and define \(\gamma ' :=\min \{\alpha ,\beta \}{\left( {\begin{array}{c}m\\ 2\end{array}}\right) }^{-1}\). For any \(\pi \in \varPi _{\infty }\) and \((h,\gamma ')\)-correct function C, Algorithm 1 instantiated with parameters m, \(\pi\), and C is a solution to \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\).

By construction, the sample complexity of Algorithm 1 is exactly the number of iterations that are required for testing the signs of all \(q_{i,j}-1/2\), \((i,j) \in (m)_{2}\). By choosing C according to (7), testing the sign of \(q_{i,j}-1/2\) requires in any case exactly \(N:=\lceil h^{-2} \ln (\gamma ^{-1}) /2 \rceil\) iid samples governed by \({\mathrm {Ber}( q_{i,j} )}\). However, the explicit time at which a pair has been sampled at least N times highly depends on the underlying sampling strategy \(\pi ,\) so that an analysis of the sample complexity of \(\mathcal {A}_{\mathrm {naive}}\) can only be done w.r.t. the corresponding sampling strategy \(\pi\). As the testing component is working in parallel to \(\pi\) in the passive setting, i.e., it has no influence on the behavior of \(\pi ,\) the minimum requirement for a test component in the passive online test seems to be consistency in terms of an a.s. finite termination time and the adherence to predefined error bounds for a general class of sampling strategies. Both requirements are met by the test underlying \(\mathcal {A}_{\mathrm {naive}}\) by Theorem 7.2 for the class \(\varPi _{\infty }\) if \(\mathcal {A}_{\mathrm {naive}}\) is instantiated with an \((h,\gamma ')\)-correct C.

Remark 7.3

In the passive online testing scenario, i.e., the sampling strategy \(\pi\) is instantiated in a black-box fashion by some dueling bandits algorithm based on a transitivity assumption (such as those by Falahatgar et al. (2017a, 2018)), it might happen that \(\pi\) terminates before the testing algorithm came to a decision, and in particular that \(\pi\) is not defined any more. In this case, if one is still interested in whether transitivity was fulfilled in hindsight, one may continue sampling according to the strategy \({\hat{\pi }}\), which picks each query \(\{i,j\} \in [m]_{2}\) with probability \(1/\left( {\begin{array}{c}m\\ 2\end{array}}\right)\).

The other way around, if the testing algorithm came to a positive decision (\({\mathbf {D}}(\mathcal {A}) = \mathrm {XST}\)), although the online ranking algorithm has not yet terminated, one can simply continue the sampling strategy without the testing component.

In case of a negative decision (\({\mathbf {D}}(\mathcal {A})=\lnot \mathrm {XST}\)), the online ranking algorithm should be interrupted due to violating the assumptions.

In the active online testing scenario (cf. Section 1), on the other side, we have the possibility to choose \(\pi\) in a favorable way and consequently analyze the sample complexity of Algorithm 1. For this purpose, we consider a sampling strategy \(\pi =\pi (m,C)\) depending on the other parameters of \(\mathcal {A}_{\mathrm {naive}},\) which focuses on the time-dependent set consisting of all pairs \(\{i,j\}\), for which it is not yet sure with confidence level \(\gamma '\) whether \(q_{i,j}>1/2\) or \(q_{i,j}<1/2\) holds. Formally, the following set is considered:

$$\begin{aligned} U_{C}(t) :=\big \{ \{i,j\} \in [m]_{2} \, \big | \, \forall t' < t \, : \, {({\varvec{ w }}_{t'})_{i,j}}/{({\varvec{ n }}_{t'})_{i,j}} \in \big [ 1/2 \pm C(({\varvec{ n }}_{t'})_{i,j}) \big ] \big \}, \quad t\in {\mathbb {N}}\, . \end{aligned}$$

In each time t, the sampling strategy \(\pi (m,C)\) queries \(\{i,j\} \in [m]_{2}\) uniformly at random from \(U_{C}(t),\) if \(U_{C}(t)\) is non-empty, and otherwise queries \(\{i,j\} \in [m]_{2}\) uniformly at random from \([m]_{2}.\) Note that the second case (i.e., \(U_{C}(t)\) is empty) is only defined in order to ensure that \(\pi \in \varPi _{\infty },\) which in turn allows for applying Theorem 7.2. In light of this, we obtain the following corollary.

Corollary 7.4

Let \(m\in {\mathbb {N}}_{\ge 3}\), \(h\in (0,1/2)\), \(\alpha ,\beta \in (0,\gamma _{0})\) for some \(\gamma _{0}\in (0,1)\), and choose \(\gamma ':=\min \{\alpha ,\beta \} / \left( {\begin{array}{c}m\\ 2\end{array}}\right)\). Let \(\pi =\pi (m,C_{h,\gamma '}^{\mathrm {Hoeffding}})\) be the sampling strategy of the above type and suppose \(\mathcal {A}\) to be Algorithm 1 called with parameters m, \(\pi\), and \(C=C_{h,\gamma '}^{\mathrm {Hoeffding}}\) from (7). Then, \(\mathcal {A}\) solves \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\) and fulfills

$$\begin{aligned} T^{\mathcal {A}} = \left( {\begin{array}{c}m\\ 2\end{array}}\right) \left\lceil \frac{h^{-2}}{2} \ln \left( \frac{m(m-1)}{2\min \{\alpha ,\beta \}}\right) \right\rceil \quad {\mathbb {P}}_{{\mathbf {Q}}}\text {-almost surely for all } {\mathbf {Q}} \in \mathcal {Q}_{m}^{h}. \end{aligned}$$

In particular, if \(\gamma :=\min \{\alpha ,\beta \}\), we have that

$$\begin{aligned} \sup \nolimits _{{\mathbf {Q}} \in \mathcal {Q}_{m}^{h}} {\mathbb {E}}_{{\mathbf {Q}}} \left[ T^{\mathcal {A}} \right] \in \mathcal {O}\left( (m^{2} \ln m) h^{-2} \ln \gamma ^{-1} \right) \end{aligned}$$

as \(\max \left\{ m, h^{-1},\gamma ^{-1} \right\} \,\rightarrow \,\infty\).

With regard to Theorem 6.1, the testing algorithm from Corollary 7.4 is already asymptotically optimal up to logarithmic factors for the \({\mathrm {WST}}\) testing problem in (1) for instances \({\mathbf {Q}} \in \mathcal {Q}_{m}^{h}\). Nevertheless, one may ask, firstly, whether termination is only possible as soon as being sure about the signs of \(q_{i,j}-1/2\) of all the \(\left( {\begin{array}{c}m\\ 2\end{array}}\right)\) many \(\{i,j\} \in [m]_{2},\) and secondly, if the rough correction term in the error probability (i.e., \(\left( {\begin{array}{c}m\\ 2\end{array}}\right)\)) for the sign test of any \(q_{i,j}-1/2\), is optimal. In the following section, we answer both questions negatively, giving rise to more sophisticated testing procedures. Moreover, we also present a solution to \({\mathcal {P}_{\mathrm { WST}}^{m,0,\alpha ,\beta }}\) and develop instance-wise upper bounds for \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\).

We conclude this section with a discussion of further suitable anytime confidence bounds, the proofs of which are deferred to the supplement for the sake of convenience. In the following, if \(p\in [0,1]\) and \(C:{\mathbb {N}}\,\rightarrow \,{\mathbb {R}}\) are fixed, let us define \(\mathcal {N}^{(p)}(C)\) as in Definition 7.1. Inspired by the sequential probability ratio test (Wald and Wolfowitz 1948) for testing whether a coin has bias \(1/2+h\) or \(1/2-h\), we may define

$$\begin{aligned} C^{\mathrm {SPRT}}_{h,\gamma }(n) :=\frac{1}{2n} \left\lceil \frac{\ln ((1-\gamma )/\gamma )}{\ln ({(1/2 +h)}/{(1/2-h)})} \right\rceil \end{aligned}$$

for any \(h\in (0,1/2)\) and \(\gamma \in (0,1/2)\). Then, \(C^{\mathrm {SPRT}}_{h,\gamma }\) is \((h,\gamma )\)-correct and fulfills

$$\begin{aligned} \sup \nolimits _{p: |p-1/2|\ge h} {\mathbb {E}}[\mathcal {N}^{(p)}(C^{\mathrm {SPRT}}_{h,\gamma })]&= (2h)^{-1} \left\lceil \frac{\ln ((1-\gamma )/\gamma )}{\ln ({(1/2 +h)}/{(1/2-h)})} \right\rceil (1-2\gamma ). \end{aligned}$$

This is shown in Lemma A.1 in the supplement. In contrast to \(C_{h,\gamma }^{\mathrm {Hoeffding}}\), choosing \(C_{h,\gamma }^{\mathrm {SPRT}}\) leads to a sequential test, where the runtime depends on the (unknown) ground-truth p, which makes the question of instance-dependent bounds actually interesting. But on the other side, for any \(p\in (0,1)\), the random variable \(\mathcal {N}^{(p)}(C_{h,\gamma }^{\mathrm {SPRT}})\) is not bounded in the sense that \(\mathcal {N}^{(p)}(C_{h,\gamma }^{\mathrm {SPRT}}) \le N\) a.s. for some \(N\in {\mathbb {N}}\). However, as we also point out in Lemma A.1, the optimality of the sequential probability ratio test assures us that choosing \(C=C^{\mathrm {SPRT}}_{h,\gamma }\) is optimal w.r.t. \({\mathbb {E}}[\mathcal {N}^{(1/2\pm h)}(C)]\).

We now turn to the more complex case of preference relations in \(\mathcal {Q}_{m}^{0}.\) In the following, we write \(\ln _{2}(\cdot ) :=\ln \ln (\cdot )\) and \(\ln _{3}(\cdot ) :=\ln \ln \ln (\cdot )\) for the sake of convenience. From a result by Farrell (1964) we can infer that, for some appropriate valueFootnote 4\(n_{0} \in {\mathbb {N}}\), the function

$$\begin{aligned} C^{\mathrm {Farrell}}_{0,\gamma }(n) :={\left\{ \begin{array}{ll} {\sqrt{ \ln _{2}(n+e)+c\ln _{3}(n+e^{e})}}/{\sqrt{8n}}, \quad &{}\text {if } n\ge n_{0}+1 \\ 1/2, \quad &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

is \((0,\gamma )\)-correct and fulfills

$$\begin{aligned} \lim _{h\,\rightarrow \,0} \frac{{\mathbb {E}}\left[ \mathcal {N}^{(1/2\pm h)}(C^{\mathrm {Farrell}}_{0,\gamma }) \right] }{h^{-2} \ln \ln h^{-1}} = \frac{1}{2}{\mathbb {P}}_{1/2}(\mathcal {N}^{(0)}(C_{0,\gamma }^{\mathrm {Farrell}}) =\infty ) > 0, \end{aligned}$$

which is shown in Lemma A.3 in the supplement. With the help of \(C^{\mathrm {Farrell}}_{0,\gamma }\), we will be able to present a solution \(\mathcal {A}\) to \({\mathcal {P}_{\mathrm { WST}}^{m,0,\alpha ,\beta }}\), in which the term \(h^{-2} \ln \ln h^{-1}\) will naturally appear in the sample-complexity bound (cf. in Theorem 8.6). As we have seen in Theorem 6.2, the \(\ln \ln h^{-1}\)-factor may not be avoided here.

8 Enhanced online WST testing

In this section, we will exploit the connection between graph theory and WST in order to improve the algorithm from Corollary 7.4. The main idea for improvement is the following: Suppose we wanted to test whether \({\mathbf {Q}} \in \mathcal {Q}_{3}\) is WST.

figure b

If we are sure enough that \(q_{2,1},q_{2,3}> 1/2\) holds (depicted by the edges \(2\,\rightarrow \,1\), \(2\,\rightarrow \,3\) in the picture to the right), then we can infer that \({\mathbf {Q}}\) is WST, since the definition of weak stochastic transitivity is fulfilled in both cases (\(q_{1,3}<1/2\) and \(q_{1,3}>1/2\)). Thus, testing \(q_{1,3}\) is in some sense superfluous. To generalize this kind of reasoning to the case \(m>3\), we first introduce a graph-theoretical interpretation of the problem.

8.1 Graph-theoretical considerations

Throughout this section, we let \(G=([m],E_{G})\) be some directed graph (digraph) on [m],  i.e., \(E_{G}\subseteq [m]\times [m]\) and whenever \((i,j) \in E_{G}\) holds then \((j,i) \not \in E_{G}\). We call G a tournament (or complete digraph), if for all distinct \(i,j\in [m]\) either \((i,j) \in E_{G}\) or \((j,i) \in E_{G}\) holds. A graph \(G\in \mathcal {G}_{m}\) is called acyclic if it does not contain any cycle.

Note that, for every \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}\) and every distinct \(i,j\in [m]\), either \(q_{i,j}>1/2\) or \(q_{j,i}>1/2\) holds. Hence, each \({\mathbf {Q}}\in \mathcal {Q}_{m}^{0}\) can be identified by a tournament \(G_{{\mathbf {Q}}} :=G = ([m],E_{G})\) with \(E_G :=\big \{ (i,j) \in [m]\times [m]\, | \, i\not = j \text { and } q_{i,j}>1/2 \big \}.\) It can be shown that \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}\) is WST iff the corresponding identifying tournament \(G_{{\mathbf {Q}}}\) is acyclic (Proposition D.2).

In the toy example above, note that the identifying tournament of \({\mathbf {Q}}\) is acyclic in any case, i.e., regardless whether \(q_{1,3}<\frac{1}{2}\) or \(q_{1,3} > \frac{1}{2}\) holds, making one edge of the identifying tournament superfluous for inferring \({\mathrm {WST}}\) of \({\mathbf {Q}}\) and allowing a correct decision merely on the digraph given by \(2\,\rightarrow \,1\), \(2\,\rightarrow \,3.\) The following two definitions generalize the idea of superfluous edges for general digraphs.

Definition 8.1

A digraph G is called transitive in expansion if each of its extensions to a tournament is acyclic. In other words, no tournament \({\tilde{G}}\) on [m] with \(E_{G} \subseteq E_{{\tilde{G}}}\) contains any cycle.

Definition 8.2

Let \(G\in \mathcal {G}_{m}\). We call a pair \(\{i,j\}\in [m]_{2}\) negligible for G if for every \(k\in [m] \setminus \{i,j\}\) either \((i,k),(j,k) \in E_{G}\) or \((k,i),(k,j) \in E_{G}\) holds.

Regarding Proposition D.2, we may write \(\mathcal {G}_{m}({\mathrm {WST}})\) for the set of all digraphs G on [m], which are transitive in expansion. The following result provides a link between transitivity in expansion and the notion of negligibility.

Proposition 8.3

Let \(G\in \mathcal {G}_{m}\). If G does not contain a cycle and every \(\{i,j\} \in [m]_{2}\) with \((i,j),(j,i)\not \in E_{G}\) is negligible for G,  then \(G \in \mathcal {G}_{m}({\mathrm {WST}})\) holds.

This result together with the connection of preference relations and tournaments, brings us closer to answering the questions raised at the end of Sect. 7, as we show the following: If G is transitive in expansion, then there exists some graph \({{\tilde{G}}}\), which is transitive in expansion, satisfying \(E_{{\tilde{G}}} \subseteq E_{G}\) and \(|E_{{\tilde{G}}}| = \left( {\begin{array}{c}m\\ 2\end{array}}\right) - \lfloor \frac{m+1}{3} \rfloor\) (Proposition D.5), i.e., in particular we have \(|E_{G}| \ge |E_{{\tilde{G}}}| = \left( {\begin{array}{c}m\\ 2\end{array}}\right) - \lfloor \frac{m+1}{3} \rfloor\). Thus, it is possible to infer \({\mathrm {WST}}\) of \({\mathbf {Q}}\) by merely considering \(\left( {\begin{array}{c}m\\ 2\end{array}}\right) - \lfloor \frac{m+1}{3} \rfloor\) edges of the identifying tournament, while a violation of \({\mathrm {WST}}\) by \({\mathbf {Q}}\) can be confirmed if the identifying tournament contains a cycle.

8.2 Exploiting transitivity in expansion

Equipped with these insights, we suggest Algorithm 2 as a testing procedure for \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\). In the next theorem, we verify that this algorithm has in fact the desired theoretical guarantees; the proof is given in Section D in the supplement.

figure c

Theorem 8.4

Let \(\pi \in \varPi _{\infty }\), \(\alpha ,\beta \in (0,1)\) and \(h\in [0,1/2)\) be fixed and define \(\gamma ' :=\min \{{\alpha }/{m}, \beta (\left( {\begin{array}{c}m\\ 2\end{array}}\right) -\lfloor \frac{m+1}{3} \rfloor )^{-1}\}\). Suppose \(C:{\mathbb {N}}\,\rightarrow \,[0,\infty ]\) is \((h,\gamma ')\)-correct, and let \(\mathcal {A}\) denote Algorithm 2 called with parameters m, \(\pi\) and C. Then, \(\mathcal {A}\) solves \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\). In case \(C=C_{h,\gamma '}^{\mathrm {X}}\) for \(\mathrm {X} \in \{\mathrm {Hoeffding},\mathrm {SPRT},\mathrm {Farrell}\}\) and \(\tilde{\mathcal {A}}\) is Algorithm 1 called with parameters m, \(\pi\) and \(C_{h,{\tilde{\gamma }}}\) with \({\tilde{\gamma }} :={\min \{\alpha ,\beta \}}/{\left( {\begin{array}{c}m\\ 2\end{array}}\right) }\) (as suggested by Theorem 7.2), \(T^{\mathcal {A}} \le T^{\tilde{\mathcal {A}}}\) holds almost surely w.r.t. \({\mathbb {P}}_{{\mathbf {Q}}}\) for any \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}\).

Lemma D.9 indicates that one can not expect to choose a correction term smaller than \(\left( {\begin{array}{c}m\\ 2\end{array}}\right) -\lfloor \frac{m+1}{3} \rfloor\) for the desired type II error within the choice of \(\gamma\) in Algorithm 2. Furthermore, the fact that the graph \(G\in \mathcal {G}_{m}\) with edges \(1\,\rightarrow \,2 \,\rightarrow \,\dots \,\rightarrow \,m \,\rightarrow \,1\) contains a cycle, unlike any of its proper subgraphs, demonstrates optimality of the correction term m for the desired type I error within the choice of \(\gamma\). As a direct consequence of Theorem 8.4, we obtain a result analogous to the one stated in Corollary 7.4 for Algorithm 2 called with m, the sampling strategy \(\pi\) from Corollary 7.4, and \(C_{h,\gamma '}^{\mathrm {Hoeffding}}\) with \(\gamma ' = \min \{{\alpha }/{m}, \beta (\left( {\begin{array}{c}m\\ 2\end{array}}\right) -\lfloor \frac{m+1}{3} \rfloor )^{-1}\}\), so that it achieves an optimal worst-case runtime (up to a logarithmic term of m) in the active online testing scenario as well.

8.3 Instance-wise upper bounds and exploiting negligibility of edges

We conclude this section with more sophisticated solutions to \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\) in the active setting, which take into account that those queries \(\{i,j\}\), which are negligible with high probability, are superfluous and should be avoided. To this end, we define the sampling strategy \(\pi ^{*}(m,C)\) as the sampling strategy which, similarly to the sampling strategies \(\pi (m,C)\) considered in Corollary 7.4, keeps track of a specific subset of \([m]_{2}\) consisting of all \(\{i,j\}\) for which \(q_{i,j} > 1/2\) or \(q_{i,j}<1/2\) can be decided with enough confidence (with regard to C) at time t. In contrast to the latter, the used subset by \(\pi ^{*}(m,C)\) takes also the negligibility of edges into account. Formally, \(\pi ^{*}(m,C)\) considers the following set at time t:

$$\begin{aligned} U_{C}^{*}(t) :=\big \{ \{i,j\} \in [m]_{2} \, \big | \,&(i,j),(j,i) \not \in {\hat{E}}_{t} \text { and } \\&\{i,j\} \text { is not negligible for } ([m],{\hat{E}}_{t}) \big \}. \end{aligned}$$

The sampling procedure of \(\pi ^{*}(m,C)\) is just like \(\pi (m,C)\), but only replacing \(U_{C}(t)\) by \(U_{C}^{*}(t)\). Note that \({\hat{E}}_{t}\) may be defined in terms of \({\varvec{ n }}_{0},{\varvec{ w }}_{0},\dots ,{\varvec{ n }}_{t-1},{\varvec{ w }}_{t-1}\) as the set of all \((i,j) \in [m]\times [m]\) for which some \(t'<t\) exists, such that

$$\begin{aligned} ({{\varvec{ w }}_{t'}}/{{\varvec{ n }}_{t'}})_{i,j} > {1}/{2} + C(({\varvec{ n }}_{t'})_{i,j}) \text {\ \ and \ \ } \forall t''<t': ({{\varvec{ w }}_{t''}}/{{\varvec{ n }}_{t''}})_{i,j} \in \left[ {1}/{2} \pm C(({\varvec{ n }}_{t''})_{i,j})\right] , \end{aligned}$$

whence \(\pi ^{*}(m,C)\) is in fact a sampling strategy as stipulated in Definition 3.1.

From Theorem 8.4, we immediately obtain that Algorithm 2 called with parameters m, \(\pi ^{*}(m,C)\) and C is a solution to \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\). But even if this guarantee holds for any \((h,\gamma ')\)-correct function C, it is desirable to choose C in such a way that the sample complexity of the corresponding algorithm is low. According to Lemma A.1, Lemma A.3, and Lemma A.2, the choices \(C=C_{h,\gamma '}^{\mathrm {SPRT}}\) resp. \(C=C_{h,\gamma '}^{\mathrm {Farrell}}\) are to some extent optimal in this regard for the cases \(h>0\) resp. \(h=0\). With these, we obtain the following instance-wise upper bounds on the expected termination time for solutions to \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\). They show that the values \(|q_{i,j}-1/2|\) determine the complexity of testing whether \({\mathbf {Q}}\) is weakly stochastic transitive or not. In comparison to the lower bound stated in Theorem 6.1, our instance-wise upper bounds depend on all \(\left( {\begin{array}{c}m\\ 2\end{array}}\right)\) instead of only \(\left( {\begin{array}{c}m-1\\ 2\end{array}}\right)\) entries of \({\mathbf {Q}}\). Needless to say, in terms of the asymptotic behavior as \(m\,\rightarrow \,\infty\), this difference is negligible.

Theorem 8.5

Suppose \(m\in {\mathbb {N}}_{\ge 3}\), \(\alpha , \beta \in (0,1/2)\), \(h\in (0,1/2)\), and define \(\gamma ' :=\min \{{\alpha }/{m}, \beta (\left( {\begin{array}{c}m\\ 2\end{array}}\right) -\lfloor \frac{m+1}{3} \rfloor )^{-1}\}\). Let \(\mathcal {A}\) be Algorithm 2 called with parameters m, the sampling strategy \(\pi ^{*}(m,C^{\mathrm {SPRT}}_{h,\gamma '})\) and \(C=C^{\mathrm {SPRT}}_{h,\gamma '}\) as the function C. Then, \(\mathcal {A}\) solves \({\mathcal {P}_{\mathrm { WST}}^{m,h,\alpha ,\beta }}\). Suppose \({\mathbf {Q}} \in \mathcal {Q}_{m}^{h}\) is fixed and write \(h_{i,j} :=|q_{i,j}-1/2|\) for all distinct \(i,j\in [m]\). Then, with \(e(h,\gamma ') :=\left\lceil \frac{\ln ((1-\gamma ')/\gamma ')}{\ln ({(1/2 +h)}/{(1/2-h)})} \right\rceil\), we have that \({\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}}]\) is bounded from above by

$$\begin{aligned} \sum _{(i,j) \in (m)_{2}} \frac{e(h,\gamma ')}{2h_{i,j}} \left| 1-2 \left( 1+(1/2 + h_{i,j})^{e(h,\gamma ')} (1/2-h_{i,j})^{-e(h,\gamma ')} \right) ^{-1} \right| . \end{aligned}$$
(8)

By means of Lemma A.1, it immediately follows that algorithm \(\mathcal {A}\) from Theorem 8.5 fulfills \(\sup _{{\mathbf {Q}}\in \mathcal {Q}_{m}^{h}} {\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}}] \in \varOmega (m^{2} \ln (m) h^{-2} \ln (\gamma ^{-1}))\) as \(\max \{m,h^{-1},\gamma ^{-1}\} \,\rightarrow \,\infty\), i.e., it is asymptotically optimal up to a \(\ln (m)\)-factor. In order to compare the result of Theorem 8.5 with the instance-wise lower bound from Theorem 6.1 more thoroughly, suppose \({\mathbf {Q}} \in \mathcal {Q}_{m}^{h}({\mathrm {WST}})\) and \((i,j) \in (m)_{2}\) with \(|\sigma _{{\mathbf {Q}}}(i)-\sigma _{{\mathbf {Q}}}(j)|>1\) to be fixed for the moment and let \(\alpha = \beta = \gamma\) for simplicity. Due to \(e(h,\gamma ') \in \varTheta (h^{-1})\) as \(h\searrow 0\), the dependency of (8) on the (ij)-entry of \({\mathbf {Q}}\) is approximately \(h_{i,j}^{-1} h^{-1}\), whereas this dependency in (6) is of the form \(h_{i,j}^{-2}\). This suggests, that the two bounds are closest in case \(h\approx h_{i,j}\). Considering that the choice \(C=C_{h,\gamma '}^{\mathrm {SPRT}}\) assures optimal early detection of \(\mathrm {sign}(q_{i,j}-1/2)\) only in case \(|q_{i,j}-1/2|=h\), the appearance of \(h^{-1}\) in (8) may not come as a surprise. Moreover, the scaling \(\gamma ' \approx \gamma /m^{2}\) leads to an additional factor of \(2\ln (m)\) in (8) compared to (6).

Theorem 8.6

Let \(m\in {\mathbb {N}}_{\ge 3},\) \(\alpha , \beta \in (0,1/2)\) be fixed and define \(\gamma ' :=\min \{{\alpha }/{m}, \beta (\left( {\begin{array}{c}m\\ 2\end{array}}\right) -\lfloor \frac{m+1}{3} \rfloor )^{-1}\}\). Suppose \(\mathcal {A}\) is Algorithm 2 called with parameters m, \(\pi ^{*}(m,C^{\mathrm {Farrell}}_{0,\gamma '})\) and \(C^{\mathrm {Farrell}}_{0,\gamma '}\). Then \(\mathcal {A}\) solves \({\mathcal {P}_{\mathrm { WST}}^{m,0,\gamma ,\gamma }}\), and there exists some \(h_{0} \in (0,1/2)\) with the following property: If \({\mathbf {Q}} \in \mathcal {Q}_{m}^{0}\) is such that \(h_{i,j} :=|q_{i,j}-1/2| \le h_{0}\) for all distinct \(i,j\in [m]\), then

$$\begin{aligned} {\mathbb {E}}_{{\mathbf {Q}}}[T^{\mathcal {A}}] \le \frac{1}{2} \sum \nolimits _{(i,j) \in (m)_{2}} h_{i,j}^{-2} \ln \ln (h_{i,j}^{-1}). \end{aligned}$$

9 Experiments

In this section, we compare the \({\mathrm {WST}}\) testing procedures from Theorems 7.2 and 8.4. Since the solution obtained by Degenne and Koolen (2019) appears infeasible in practice (Remark F.1), we do not consider it in our experiments. For the sake of simplicity, we focus on the passive testing scenario, with \(\pi \in \varPi _{\infty }\) being such that it chooses its queries at each time step uniformly at random from \([m]_{2}\). We also fix \(\alpha = \beta = 0.05\) as well as \(h=0.01\) in the following. Further, we will write \(\mathcal {A}_{\mathrm {naive}}\) for Algorithm 1 instantiated with the parameters m, \(\pi\) and \(C^{\mathrm {SPRT}}_{h,\gamma '}\) with \(\gamma ' :={\min \{\alpha ,\beta \}}/{\left( {\begin{array}{c}m\\ 2\end{array}}\right) }\), and \(\mathcal {A}_{\mathrm {improved}}\) for Algorithm 2 called with parameters m, \(\pi\) and \(C^{\mathrm {SPRT}}_{h,\gamma ''}\) with \(\gamma '' :=\min \{{\alpha }/{m}, \beta (\left( {\begin{array}{c}m\\ 2\end{array}}\right) -\lfloor \frac{m+1}{3} \rfloor )^{-1}\}\). Here, we have chosen the boundary function C due to its optimal behavior w.r.t. the expected runtime on some instances as stated in Lemma A.1.

In the first experiment, we investigate the termination time of \(\mathcal {A}_{\mathrm {naive}}\) and \(\mathcal {A}_{\mathrm {improved}}\) for preference relations in \(\mathcal {Q}_{m}^{0.05}({\mathrm {WST}})\) or \(\mathcal {Q}_{m}^{0.05}(\lnot {\mathrm {WST}}).\) To this end, we sample \({\mathbf {Q}}\) uniformly at random from \(\mathcal {Q}_{m}^{0.05}({\mathrm {WST}})\) (resp. \(\mathcal {Q}_{m}^{0.05}(\lnot {\mathrm {WST}})\)), run the test algorithms until termination, respectively, and repeat this process for 100 times. Here, both \(\mathcal {A}_{\mathrm {naive}}\) and \(\mathcal {A}_{\mathrm {improved}}\) — started with some \({\mathbf {Q}}\) — observe the same duel chosen by \(\pi\) in each time step, as well as the same outcome of the duel. As stated in Theorem 8.4, \(\mathcal {A}_{\mathrm {improved}}\) may thus terminate earlier than \(\mathcal {A}_{\mathrm {naive}}\) in any case. In the following table we report the obtained average termination times (and the corresponding standard error in brackets) for varying values of m.

 

\({\mathrm {WST}}\)

\(\lnot {\mathrm {WST}}\)

 

\(\mathcal {A}_{\mathrm {naive}}\)

\(\mathcal {A}_{\mathrm {improved}}\)

\(\mathcal {A}_{\mathrm {naive}}\)

\(\mathcal {A}_{\mathrm {improved}}\)

\(m=4\)

\(5540 \, (329.3)\)

\(\mathbf {2936} \, (245.5)\)

\(5273 \, (325.4)\)

\(\mathbf{3468} \, (315.6)\)

\(m=5\)

\(11,670 \, (601.7)\)

\(\mathbf {9862} \, (596.3)\)

\(12,041 \, (581.5)\)

\(\mathbf {4380} \, (367.1)\)

\(m=6\)

\(20,420 \, (789.1)\)

\(\mathbf {17,951} \, (810.7)\)

\(20,374 \, (921.3)\)

\(\mathbf {4903} \, (235.6)\)

\(m=7\)

\(36,149 \, (1403.7)\)

\(\mathbf {32,429} \, (1408.6)\)

\(35,261 \, (1535.9)\)

\(\mathbf {6203} \, (342.1)\)

\(m=8\)

\(52,214 \, (2050.0)\)

\(\mathbf {48,216} \, (2009.0)\)

\(55,727 \, (1910.8)\)

\(\mathbf {7066} \, (191.6)\)

The results reveal that \(\mathcal {A}_{\mathrm {improved}}\) needs significantly less samples for checking WST than \(\mathcal {A}_{\mathrm {naive}}\) throughout, and the effect is strongest if \({\mathbf {Q}}\) is not WST and m is large. In particular, if the underlying preference relation is not WST, the termination time of \(\mathcal {A}_{\mathrm {improved}}\) is mostly decreasing with the number of available arms, while the termination time of \(\mathcal {A}_{\mathrm {naive}}\), on the other side, increases rapidly with the number of arms. Moreover, both test algorithms did not make any error in deciding whether WST holds or not for the underlying preference relation \({\mathbf {Q}}\), i.e., the observed accuracy of both test algorithms was \(100 \%\) throughout. Last but not least, it is worth mentioning that \(\mathcal {A}_{\mathrm {improved}}\) (as well as \(\mathcal {A}_{\mathrm {naive}}\)) terminates for each problem scenario much earlier than the derived worst-case upper bound \((2h)^{-1} \left\lceil \frac{\ln ((1-\gamma '')/\gamma '')}{\ln ({(1/2 +h)}/{(1/2-h)})} \right\rceil (1-2\gamma '')\left( {\begin{array}{c}m\\ 2\end{array}}\right)\), which is \(\ge 4370 \left( {\begin{array}{c}m\\ 2\end{array}}\right)\) for any \(m\ge 3\) (cf. Theorem 8.5).

Next, we analyze the impact of the degree of violation of WST within a preference relation \({\mathbf {Q}}\) — measured by the number of cyclesFootnote 5 in the identifying tournament \(G_{{\mathbf {Q}}}\) — on the sample complexities of \(\mathcal {A}_{\mathrm {naive}}\) and \(\mathcal {A}_{\mathrm {improved}},\) respectively. For this purpose, we choose \({\mathbf {Q}}_{1}\), \({\mathbf {Q}}_{2}\), \({\mathbf {Q}}_{3}\) and \({\mathbf {Q}}_{4}\) as

$$\begin{aligned} {\small \begin{pmatrix} - &{} x &{} x &{} x &{} x &{} x \\ &{} - &{} x &{} x &{} x &{} x \\ &{} &{} - &{} x &{} x &{} x \\ &{} &{} &{} - &{} x &{} x \\ &{} &{} &{} &{} - &{} x \\ &{} &{} &{} &{} &{} - \\ \end{pmatrix}, \begin{pmatrix} - &{} x &{} y &{} x &{} x &{} x \\ &{} - &{} x &{} x &{} x &{} x \\ &{} &{} - &{} x &{} x &{} x \\ &{} &{} &{} - &{} x &{} x \\ &{} &{} &{} &{} - &{} x \\ &{} &{} &{} &{} &{} - \\ \end{pmatrix}, \begin{pmatrix} - &{} x &{} y &{} x &{} y &{} x \\ &{} - &{} x &{} y &{} x &{} x \\ &{} &{} - &{} x &{} x &{} x \\ &{} &{} &{} - &{} x &{} x \\ &{} &{} &{} &{} - &{} x \\ &{} &{} &{} &{} &{} - \\ \end{pmatrix} \text { and } \begin{pmatrix} - &{} x &{} y &{} x &{} y &{} x \\ &{} - &{} x &{} y &{} x &{} x \\ &{} &{} - &{} x &{} x &{} y \\ &{} &{} &{} - &{} x &{} x \\ &{} &{} &{} &{} - &{} x \\ &{} &{} &{} &{} &{} - \\ \end{pmatrix},} \end{aligned}$$

respectively, where \(x:=0.6\) and \(y:=0.4\). The following table shows the number of cycles in \(G_{{\mathbf {Q}}_{i}}\) together with the average runtimes (as well as the empirical standard errors in brackets) of \(\mathcal {A}_{\mathrm {naive}}\) and \(\mathcal {A}_{\mathrm {improved}},\) if started with \({\mathbf {Q}}_{i},\) over 100 runs. We also added the average elapsed time \(T_{\mathrm {elapsed}}\) (in seconds) per run as an indicator of the computational costs of \(\mathcal {A}_{\mathrm {naive}}\) and \(\mathcal {A}_{\mathrm {improved}}\). All experiments were run on a single CPU.Footnote 6

  

\(\mathcal {A}_{\mathrm {naive}}\)

\(\mathcal {A}_{\mathrm {improved}}\)

i

\(\#\) cycles in \(G_{{\mathbf {Q}}_{i}}\)

\(T^{\mathcal {A}}\)

\(T_{\mathrm {elapsed}}\)

\(T^{\mathcal {A}}\)

\(T_{\mathrm {elapsed}}\)

1

0

\(25,919 \, (332.3)\)

0.60

\(\mathbf {25,639} \, (340.8)\)

2.16

2

1

\(25,170 \, (296.4)\)

0.58

\(\mathbf {10,609} \, (187.4)\)

0.44

3

9

\(25,599 \, (366.1)\)

0.60

\(\mathbf {8988} \, (110.3)\)

0.31

4

28

\(26,014 \, (355.7)\)

0.60

\(\mathbf {9063} \, (110.7)\)

0.31

These results support the following conclusions. Firstly, the larger the number of cycles in the identifying tournament \(G_{{\mathbf {Q}}_{i}}\) of the underlying preference relation \({\mathbf {Q}}_{i}\) (i.e., the more severe the WST property is violated), the lower the sample complexity of \(\mathcal {A}_{\mathrm {improved}}\) is on average. Secondly, the latter effect reveals an “elbow” dependency in the sense that the decrease of the termination time is rapidly declining with the number of cycles, with the strongest decline if at least one cycle is present. Thirdly, \(\mathcal {A}_{\mathrm {naive}}\) does not seem to benefit from stronger violations of WST and in fact does not exploit structural properties of the current estimated preference relation for an early termination such as \(\mathcal {A}_{\mathrm {improved}}\) does. Finally, the results for \({\mathbf {Q}}_{1}\) with regard to the averaged elapsed time demonstrate that checking the transitive in expansion property of the internal graph maintained by \(\mathcal {A}_{\mathrm {naive}}\) (i.e., line 7 in Algorithm 2) increases the computational cost per iteration step by a factor of \(\approx \frac{2.16}{25639} \frac{25919}{0.6} \approx 3.64\). However, the superiority of \(\mathcal {A}_{\mathrm {improved}}\) over \(\mathcal {A}_{\mathrm {naive}}\) in terms of sample complexity is so strong, that it outperforms \(\mathcal {A}_{\mathrm {naive}}\) even with regard to computational costs on \({\mathbf {Q}}_{2},{\mathbf {Q}}_{3}\) and \({\mathbf {Q}}_{4}\).

In summary, the experiments empirically confirm our theoretical results on the superiority of the enhanced testing algorithm \(\mathcal {A}_{\mathrm {improved}}\) compared to \(\mathcal {A}_{\mathrm {naive}}.\)

10 Conclusion

In this paper, we have analyzed the problem of testing stochastic transitivity assumptions within the dueling bandits framework. For various types of stochastic transitivity, we provided instance-dependent lower bounds on the expected number of samples needed by any sequential test to come to a test decision obeying predefined error bounds. These results indicate that testing a stochastic transitivity assumption stronger than weak stochastic transitivity is hopeless in worst case scenarios.

In light of these results, we have introduced a flexible algorithmic framework, which allows one to either monitor the validity of the weak stochastic transitivity assumption made by a dueling bandit algorithm during its sampling process in a passive way, or to actively query pairs of arms in order to confirm or refute this assumption as quickly as possible. To this end, we designed a sequential testing method within the algorithmic framework and provided theoretical guarantees for its type I and type II error as well as an almost surely finite termination time within the passive testing scenario, if it is instantiated with an appropriate function to measure the confidence of pairwise probability estimates. In addition, we have provided some examples for appropriate confidence functions and have shown optimality of the resulting algorithm up to a logarithmic factor in terms of the expected runtime for a suitable sampling strategy, which is actively supporting the test component. Finally, we enhanced the testing method by incorporating graph-theoretical considerations, resulting in faster decisions on the validity or violation of WST, and provided instance-dependent upper bounds on the expected runtime of this testing procedure.

Based on our findings, it would be of interest to transfer the ideas for WST testing as developed in this paper to weaker yet still practically relevant assumptions in the realm of dueling bandits, such as the existence of a Condorcet Winner. Furthermore, a more thorough experimental study for the suggested algorithmic framework would be important to gain more insights into the actual degree of support provided by the testing component to already established sampling strategies for ranking problems.