1 Introduction

Consider a supervised data set containing p explanatory variables for each of n individuals, \(X \in \mathbb {R}^{p}\), and a corresponding outcome, Y, which may be either discrete or continuous. These data may be high-dimensional, in the sense that p may be larger than n. An interesting problem is to discover which variables in X provide us with non redundant information on the value of Y: this knowledge represents a first step in understanding possible causal models linking X and Y and provides the building blocks for robust prediction rules. This variable selection goal has been recently expressed in terms of multiple testing, by focusing on the following null hypotheses of conditional independence (Candès et al. 2018) for all j ∈{1,…,p}:

(1)

where Xj indicates all variables in X except for the j-th one.

The method of knockoffs, introduced by Barber and Candès (2015) for low-dimensional linear regression and later extended by Candès et al. (2018) to the more general setting considered in this paper, allows one to test the hypotheses in Eq. 1 and provably controls the false discovery rate (FDR)—the expected proportion of rejected nulls that are true (Benjamini and Hochberg, 1995). A distinctive advantage of conditional testing with knockoffs is that it is fully non-parametric, requiring no modeling assumptions about the distribution of YX, namely PYX, which is possibly complex and generally unknown. Instead, knockoffs treat the explanatory variables as random and model their joint distribution, PX. This approach is quite natural in some applications such as in a genome-wide association study (GWAS), where the goal is to discover which genetic variants, among the hundreds of thousands measured in modern studies, are useful to explain the inheritance of a polygenic trait (e.g., blood pressure) or disease (e.g., diabetes). In this context, knockoffs leverage well-established models for the transmission of genotypes from parents to offspring (Sesia et al. 2019) that provide a distribution PX, without having to make assumption on the relation between genotypes and phenotypes, which is typically unknown. While conditional testing with knockoffs leads to interpretable and meaningful discoveries in GWAS and other applications, the power of this methodology is naturally limited by the size of the sample at hand. In many context, however, scientists have available related data sets that can be leveraged to increase power. In genetics, for example, studies involving individuals of African descent are still limited in number and size (Popejoy and Fullerton, 2016), but large collection of genotypes and phenotypes for subject of European ancestry are available (Bycroft et al. 2018) and can provide relevant information. Building upon prior work on knockoffs, this paper presents and compares different transfer learning methods for testing the conditional hypotheses Hj in Eq. 1 as powerfully as possible while leveraging prior knowledge from other data sets, which may be relevant but do not correspond exactly to the population of interest. Although these methods will be generally applicable, this paper focuses in particular on their implications to the analysis of GWAS data, where transferability across populations with different ancestries is of particular scientific and societal concern (Duncan et al. 2019), as we shall discuss after stating our statistical challenge more formally.

1.1 Problem Statement

Imagine that data were collected from many environments (specific studies, experimental settings, or distinct sub-popula- tions) e ∈{0,1,…,E}, for some \(E \in \mathbb {N}\), each corresponding to a particular distribution \(P^{e}_{XY} = {P^{e}_{X}} \cdot P^{e}_{Y \mid X}\), where \({P^{e}_{X}}\) denotes the joint distribution of the explanatory variables and \(P^{e}_{Y \mid X}\) is the conditional distribution of the outcome. The object of inference is \(P^{e}_{Y \mid X}\) in the target environment e = 0. That is, we wish to test the conditional hypotheses Hj in Eq. 1 corresponding to the unknown \(P^{e}_{Y \mid X}\) for e = 0; we will refer to this specific null as \({H_{j}^{0}}\). Even though both \(P^{e}_{Y \mid X}\) and the definition of the X,Y variables may differ across environments, we are interested in situations where one has reason to believe that there might be commonalities across the \(P^{e}_{Y \mid X}\), so that it makes sense to attempt learning from others. It is useful to focus on a specific instance of the problem above and contrast the goal of our analysis with those of other existing approaches. Let us begin by assuming the definitions of the X and Y variables are the same in all environments, so that there is a clear correspondence between the various \({H_{j}^{e}}\) for all e. Our goal is to leverage all available information (transfer learning) to test \({H_{j}^{0}}\): this makes sense when one believes that the environment e = 0 is sufficiently distinct from the others and of specific scientific interest. By contrast, in other contexts one may want instead to test the specific hypotheses \({H_{j}^{e}}\) separately environment-by-environment using the original method of Candès et al. (2018), or the global null \({\mathscr{H}}^{\text {meta}}_{j} :\cap _{j=0}^{E} {H_{j}^{e}}\), as it happens in meta-analysis studies, or the composites null \({\mathscr{H}}^{\text {comp}}_{j} : \exists e \in \{1,\ldots , E\} \text { such that the null } {\mathscr{H}}^{e}_{j}~\text {is true} \), such as in Li et al. (2021), or softer partial conjunction (Benjamini and Heller, 2008) versions of the latter. The transfer learning problem considered in this paper is non-trivial; for example, the standard knockoffs methodology of Candès et al. (2018) applied to the pooled data from all environments would not be a fully satisfactory solution to test \({H_{j}^{0}}\), although it is a valid and reasonable test of \({\mathscr{H}}^{\text {meta}}_{j}\). Indeed, \(P^{e}_{Y \mid X}\) may vary across environments, and in that case the above naive approach would be invalid because we seek inferences about \(\smash { P^{0}_{Y \mid X}} \), not on some mixture distribution. Further, even if \(\smash {P^{e}_{Y \mid X}}\) were to identify the same non-null variable for all e ∈{0,…,E}, in which case pooling would provide a valid test of \({H_{j}^{0}}\) within the target environment, there may be at least two compelling motivations to search for alternative methods of analysis. First, one may have direct access only to the observations from the target environment, with the information from the other data sets compressed in the form of summary statistics, due to privacy or computing constraints, for example. Second, covariate shifts (changes in \(\smash {{P^{e}_{X}}}\)) may render some discoveries more or less relevant in different environments. Concretely, imagine a variable Xj that in theory has the same effect on Y according to all \(\smash { P^{e}_{Y \mid X}} \), but in practice is always (nearly) constant for all observations from the target environment. If Xj varies appreciably in the other environments (which have different \(\smash {{P^{e}_{X}}}\)), we should expect to discover it through pooling. However, this finding would not be very useful if our focus is to predict Y from X for future observations from the target environment.

1.2 Transferability in GWAS

While the transfer learning methods studied in this paper are widely applicable, our work is particularly motivated by a specific problem in GWAS. The problem arises from the fact that most of the genotype and phenotype data collected so far and available for general analysis involve individuals of European descent, either living in Europe or in the United States. This introduces some biases and represents a challenge to the development of clinically useful prediction tools based on genetic information (Popejoy and Fullerton, 2016; Sirugo et al. 2019). It is therefore important to analyze the relatively scarce data available from diverse populations in the most efficient way, possibly leveraging, or “transferring”, some of the information already gathered from Europeans in order to discover the most relevant genetic variants for the minority populations. To avoid misunderstanding, and to clarify what assumptions might be appropriate in this context, it is useful to provide some additional background. As a consequence of the processes by which human populations evolved, many genetic variants have different allele frequencies in groups with different ancestries, and the dependency patterns within one chromosome (linkage disequilibrium) also vary across populations (Laan and Pääbo, 1997). For example, a variant might have originated five generations ago in an individual living in Finland: in this case we would expect to see the alternative allele only in this person’s descendants, or at least predominantly so as new mutations are always possible. Or it might be that an allele was common in the handful of individuals who left Africa to populate a region in Asia: this allele will now have higher frequency among individuals who originate from that region, and it will be associated with alleles at neighboring variants that were present in the founder groups. By contrast, the same allele might have a different set of associated neighboring alleles in populations descendant from other lineages. Differences in allele frequencies and in linkage disequilibrium patterns are such that the genotypes of each individual carry distinctive signatures of the human populations to which they belong (Rosenberg et al. 2002). This must be kept in mind by geneticists in order to avoid making an excessive number of false positives while analyzing GWAS data (Price et al. 2006), and it has in part motivated the use of ethnically homogeneous samples. However, as the field of statistical genetics progresses and more associated variants are discovered, the bias towards the analysis of data from individuals of European descent gives rise to more issues. In particular, it has been observed that the genetic markers with discovered associations tend to have better predictive power and association strength in Europeans (Wray et al. 2013; Martin et al. 2017; Duncan et al. 2019) compared to other populations, to the point that it has been suggested that their use in clinical practice might be unequitable (Martin et al. 2019). In response to this problem, there have several recent efforts to collect GWAS data from more diverse populations (Wojcik et al. 2019), and to develop new statistical methodologies that incorporate them effectively (Coram et al. 2017). The importance of purposefully collecting data from individuals that identify as members of different ethnicities (Popejoy and Fullerton, 2016; Reich, 2018), as well as of the appropriateness of develo** population-specific models for disease risk prediction, have lead to some misinterpretations and controversy (Holmes, 2018). Note that the term “population” is used here loosely to indicate a group of people with common ancestry that mates prevalently internally and shares a common environment and lifestyle. Of course, populations are becoming much more intermixed in our modern society, but some of the meaningful variations associated with their historical roots persist. The medical relevance of genetic variation can be understood by imagining that the same biological processes are responsible for translating genotypes to phenotypes for all humans, irrespective of ancestry. However, while attempting to reconstruct through statistical analyses this underlying causal model, and to understand the roles played within it by different genetic variants, we are faced with particular difficulties due to the variability in allele frequency, linkage disequilibrium, and environmental exposures that characterizes human populations. For example, a causal variant might occur in a particular population rarely, or even not at all. Further, a causal variant might affect the phenotype by interacting with other non-genetic variables, i.e., consumption of gluten. The level of gluten intake in different populations can be quite diverse, and so one should expect to observe different effect sizes for the aforementioned genetic variant in models of the phenotype that do not account for gluten consumption. Finally, a causal variant may not be directly measured in the data set at hand, and the genotyped proxies which best capture its underlying signal may differ across population depending on their patterns of linkage disequilibrium. The above considerations suggest that one should carefully take into account the ancestry of each individual while analyzing GWAS data from heterogeneous populations. Indeed, the importance of population-specific approaches has been underscored in a number of papers demonstrating that models fitted on data from European samples do not predict accurately the phenotypes of individuals with other ethnicities (Wray et al. 2013; Martin et al. 2017; Duncan et al. 2019). At the same time, given the scarcity of data on individuals of non-European descent, and based on the principle that the fundamental underlying biology is the same for all humans, it is also important to use models that can leverage the information available in European samples when analyzing data from other populations, and this is precisely the goal of “transfer learning.”

1.3 Related Work

While Bayesian inference presents a natural framework to leverage side information, we are here interested in working within the frequentist framework, and specifically through hypothesis testing. Previous work has illustrated how to incorporate side information within the context of hypothesis testing with FDR control, mostly focusing in settings in which p-values are available. For example, Genovese et al. (2006) developed a weighted variation of the Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg, 1995) such that larger prior weights make it easier for the corresponding hypotheses to be rejected. Later works presented different ways of determining the weights, either assuming the side information is independent of the p-values (Roquain and Van De Wiel, 2009), or allowing for data dependent weighting (Hu et al. 2010; Zhao and Zhang, 2014; Ignatiadis et al. 2016; Durand, 2019). The weighted BH procedure was further extended by Ignatiadis and Huber (2021), which considered learning data-adaptive p-value weights using cross-weighting, and by Lei and Fithian (2018), which trained a sequence of p-value thresholds adaptively and iteratively. If valid p-values are not available, as it is the case for conditional testing with GWAS data (Sesia et al. 2020), the above methods cannot be applied. To address this challenge, Ren and Candès (2022) developed an adaptive knockoff filter, extending the original knockoff filter of Barber and Candès (2015) and Candès et al. (2018) in order to increase its power by leveraging side information. This paper applies the method of Ren and Candès (2022) to analyze GWAS data from individuals with different ancestries (Sesia et al. 2021), comparing its performance to that of a novel alternative approach. The main difference between the adaptive knockoff filter of Ren and Candès (2022) and the novel method proposed here in Section 2.3 is that the latter directly leverages side information while analyzing the raw genotype-phenotype data, while the former operates on pre-computed knockoff statistics. Our solution is made possible by the general flexibility of the knockoffs framework (Candès et al. 2018), which allows one to control the FDR utilizing any test statistics. This work is also partially inspired by Li et al. (2021), which study the related problem of testing for robust associations that are consistent across many environments.

The idea of leveraging side information from external data sets is common in the context of predictive inference, where it is typically described as “transfer learning” (Heckman, 1979; Pan and Yang, 2009). While this paper focuses on testing rather than prediction, these problems are closely related and the methods described here could naturally be applied to construct predictive models, for example selecting which genetic markers should be utilized to compute more efficient polygenic risk scores leveraging the information contained in GWAS data from different populations.

2 Methods

2.1 Review: Knockoffs

Knockoffs provide a method to analyze data from one environment e and test the conditional hypothesis \({H_{j}^{e}}\) (1), controlling the FDR over all variables j ∈{1,…,p}. In the model-X setting we consider (Candès et al. 2018), the first step of this procedure consists of generating synthetic variables \(\tilde {X}^{e} = (\tilde {X}_{1}^{e}, \dots , \tilde {X}_{p}^{e})\) that imitate the distribution of the original variables \(X^{e} = ({X_{1}^{e}}, \dots , {X_{p}^{e}})\) but are known to be null in the sense of \({H_{j}^{e}}\) (1). In particular, \(\tilde {X}^{e}\) is created as a function of Xe without looking at Ye, so that , and it is carefully designed to ensure that the joint distribution of \((X^{e},\tilde {X}^{e})\) remains unaltered if any variables are swapped with the corresponding knockoffs: \(\smash {(X^{e}, \tilde {X}^{e})_{\operatorname {swap}(S)} \stackrel {d}{=}(X^{e}, \tilde {X}^{e})}\) for any \(\mathcal {S} \subset \{1, \dots , p\}\). Here, the notation \(\smash {(X^{e}, \tilde {X}^{e})_{\operatorname {swap}(S)}}\) indicates the 2p-dimensional vector obtained by swap** the elements of Xe indexed by S with the corresponding elements of \(\tilde {X}^{e}\). In practice, the construction of knockoffs requires knowledge of the joint distribution of the original variables, \({P_{X}^{e}}\), and practical algorithms have already been developed to handle many possible cases, including multivariate Gaussian distributions (Candès et al. 2018) and hidden Markov models (Sesia et al. 2019). Even if \({P_{X}^{e}}\) is completely unknown, algorithms are available to construct approximate knockoffs (Romano et al. 2020). As our work focuses on a separate aspect of the analysis, we assume henceforth that valid knockoffs are available. The second step of a knockoff analysis computes test statistics \(W^{e} \in \mathbb {R}^{p}\) that provide information on the potential dependency of Y on X. These statistics are a function of all data \(\mathbf {X}^{e} \in \mathbb {R}^{n \times p}, \mathbf {Y}^{e} \in \mathbb {R}^{n}\) and knockoffs \(\tilde {\mathbf {X}}^{e} \in \mathbb {R}^{n \times p}\), and they encode the idea that the original variables must be significantly more predictive of Y compared to \(\tilde {X}\) in order to allow the rejection of the null \({H_{j}^{e}}\) (1). Specifically, each element of W is defined as

$$ {W^{e}_{j}}=w_{j}([\mathbf{X}^{e}, \tilde{\mathbf{X}}^{e}], \mathbf{Y}^{e}), $$
(2)

where the function wj satisfies the following flip-sign property: swap** the j th column of Xe, namely \({\mathbf {X}_{j}^{e}}\), with \(\tilde {\mathbf {X}}_{j}^{e}\) has the only effect of changing the sign of \({W_{j}^{e}}\). A typical way of computing these statistics is to estimate a sparse linear (or generalized linear) regression model of Ye given the (standardized) predictors \([\mathbf {X}^{e}, \tilde {\mathbf {X}}^{e}]\), extract the fitted coefficients \(\hat {b}^{e} \in \mathbb {R}^{2p}\), and set \({W^{e}_{j}} = |\hat {b}_{j}^{e}| - |\hat {b}^{e}_{j+p}|\). If \({H_{j}^{e}}\) (1) is not true, one expects to see a large \(\smash {|\hat {b}_{j}^{e}|}\) but small \(\smash {|\hat {b}_{j+p}^{e}|}\), because \(\smash {\tilde {\mathbf {X}}^{e}_{j}}\) is by construction independent of Ye given the other variables; in that case, the corresponding \({W^{e}_{j}}\) would tend to be positive and large. By contrast, if \({H_{j}^{e}}\) (1) is true, \([{\mathbf {X}^{e}_{j}}, \tilde {\mathbf {X}}^{e}_{j}]\) has the same distribution as \([ \tilde {\mathbf {X}}^{e}_{j}, {\mathbf {X}^{e}_{j}}]\) conditional on Ye and on the other variables, and thus \({W^{e}_{j}}\) is equally likely to be positive or negative. Formally, the signs of the null \({W^{e}_{j}}\) are i.i.d. coin flips conditional on \((|{W^{e}_{1}}|, \dots , |{W^{e}_{p}}|)\) (Candès et al. 2018), and this allows the computation of a significance threshold guaranteeing FDR control below any desired level q ∈ (0,1). This threshold is computed by the knockoff filter (Barber and Candès, 2015) as

$$ T^{e}=\min \left\{t: \frac{1+\#\left\{j: {W^{e}_{j}} \leq-t\right\}}{\#\left\{j: {W^{e}_{j}} \geq t\right\} \vee 1} \leq q\right\}, $$
(3)

and the corresponding set of discoveries is \(\hat {\mathcal {S}} = \{j: {W^{e}_{j}} \geq T^{e}\}\), with \({\min \limits } \emptyset = \infty \).

An equivalent reformulation of the knockoff filter (3) which will be useful in this paper is the following. Imagine sorting the p target hypotheses in ascending ordering of the absolute values of the knockoff statistics We input to the filter, i.e., \(|W^{e}_{\pi _{1}}| \leq |W^{e}_{\pi _{2}}| \leq {\ldots } \leq |W^{e}_{\pi _{p}}|\), where π1,…,πp indicate the order statistics of |We|. Then, sequentially compute

$$ \widehat{\operatorname{FDR}}(k) = \frac{1+{\sum}_{j > k} \mathbf{1} {\left\{W^{e}_{\pi_{j}}<0\right\}}}{\left( {\sum}_{j > k} \mathbf{1} {\left\{W^{e}_{\pi_{j}}>0\right\}}\right) \vee 1}, $$
(4)

for each \(k = 0,1, \dots , p-1\) until \(\widehat {\operatorname {FDR}}(k) \leq q\), and reject all \(H_{\pi _{j}}\) such that j > k and \(W^{e}_{\pi _{j}}>0\). In the case that \(\smash {\widehat {\operatorname {FDR}}(k) > q}\) for all k, no hypotheses are rejected. This formulation highlights that the role of \(|{W^{e}_{j}}|\) is to offer an informative ordering of the hypotheses, with the idea that those with positive statistics should be found at the end of this sequence in order to maximize power.

2.2 Transfer Learning with Linearly Re-ordered Knockoff Statistics

If the goal is to use knockoffs to test \({H_{j}^{e}}\) (1) for one particular environment, i.e., e = 0, the trivial solution is to apply the existing methodology reviewed in the previous section to the available data collected from the population of interest (Candès et al. 2018). However, if additional observations are available from other environments e ∈{1,…,E} which may share some similarity in PYX, one may want to incorporate that information into the analysis as efficiently as possible. To this end, consider the following simple procedure: after generating knockoffs for every e ∈{0,1,…,E}, apply the standard method from Candès et al. (2018) to the data from environment e = 0 obtaining statistics W0, and to the pooled data from the external environments e ∈{1,…,E} obtaining statistics Wext. Then, combine W0 and Wext into new statistics \(W^{\operatorname {lro}} \in \mathbb {R}^{p} \) defined such that:

$$ \operatorname{sign} (W_{j}^{\operatorname{lro}})= \operatorname{sign} ({W_{j}^{0}}). $$
(5)
$$ |W_{j}^{\operatorname{lro}}|= (1-\theta) |{W_{j}^{0}}| + \theta |W_{j}^{\operatorname{ext}}|, \quad \text{ for some fixed } \theta \in [0,1]. $$
(6)

In words, the signs of Wlro are the same as those of the naive statistics computed on the environment of interest, while their absolute values are a linear combination of |W0| and \(\smash {|W^{\operatorname {ext}}|}\). In the special case of 𝜃 = 0, we recover Wlro = W0. This procedure ensures the signs of null Wlro are i.i.d. coin flips conditional on |Wlro|, which implies the standard knockoff filter (3) can be applied to control the FDR for the hypotheses \({H_{j}^{e}}\) (1) in the environment e = 0 of interest (Candès et al. 2018).

Proposition 2.1.

All mathematical proofs are in Appendix A. Let Wlro be the knockoff statistics for the target environment e = 0 computed with the linear re-ordering method described above, based on prior importance statistics Wext computed on data from the external environments e ∈{1,…,E}. Conditional on |Wlro|, the signs of \(W_{j}^{\operatorname {lro}}\) for all null j corresponding to a true \({H_{j}^{0}}\) (1) are i.i.d. coin flips.

Intuitively, if the null \({H_{j}^{e}}\) (1) is not true for e = 0 and the external environments are similar to the target one, the value of \(\smash {W^{\operatorname {ext}}_{j}}\) tends to be larger than that of \(\smash {{W^{0}_{j}}}\). This increases power, especially if the combined sample sizes for the other environments are larger than that of the target one. By contrast, if Xj is null in the sense of \({H_{j}^{e}}\) (1) for e = 0, the sign of \(W^{\operatorname {lro}}_{j}\) is still a fair coin flip independent of everything else (the statistics computed on data from different environments are mutually independent), which allows rigorous FDR control. The value of 𝜃 in (6) must be specified before looking at the data, but unfortunately the optimal choice that maximizes power depends on the data. For example, one should intuitively choose a larger 𝜃 if the external environments have large sample sizes and are similar to the target one, while smaller values of 𝜃 may otherwise be preferable. This dilemma motivates the following alternative approach.

2.3 Transfer Learning with the Adaptive Knockoff Filter

The work of Ren and Candès (2022) developed an extension of the knockoff filter that can be directly applied to address our transfer learning problem. The advantage of this adaptive knockoff filter is that it can leverage the external statistics \(W_{j}^{\operatorname {ext}}\), and any other relevant prior information, in a more flexible and data-driven manner, possibly yielding higher power without involving the particularly sensitive and unknown parameter 𝜃 required by the latter. In particular, Ren and Candès (2022) prove that the procedure in Eq. 4 controls the FDR even if the statistics are re-arranged in some other data-dependent order π, as long as π satisfies a sign-invariance property that generally allows much more flexibility compared to the original knockoff filter (3). Informally, their approach dynamically learns at each step k of Eq. 4 a new data-dependent ordering πk, combining the prior information contained in |W| and \(W_{j}^{\operatorname {ext}}\) with the additional knowledge of the signs of \(W_{\pi _{j}}\) for all jk. This solution can adaptively adjust the weight given to the external information, relative to that given to the internal statistics, based on preliminary estimates of the numbers of discoveries that may be achievable on the available data set. As a result, if the side information is relevant, their procedure tends to relocate more of the positive statistics at the end of the testing sequence, thereby increasing the number of rejections.

While the adaptive knockoff filter is in principle very flexible, we highlight here a particular implementation that extends intuitively the linear combination approach outlined in Eqs. 56. In fact, the aforementioned method from the previous section is equivalent to applying the procedure in Eq. 4 with \(\pi _{k + 1} = \arg \max \limits _{j \notin \{\pi _{1}, \dots , \pi _{k}\}} \{ (1-\theta ) |{W_{j}^{0}}| + \theta |W_{j}^{\operatorname {ext}}| \}\), for a parameter 𝜃 fixed a priori. The adaptive knockoff filter generalizes this solution as it allows tuning a different value of 𝜃 at each step k. For example, consider the following logistic model for the unknown signs of the test statistics:

$$ \operatorname{logit} \mathbb{P}[ \operatorname{sign}({W_{j}^{0}}) = -1 ] = \theta_{1} |{W_{j}^{0}}| + \theta_{2} |W_{j}^{\operatorname{ext}}|, $$

for some parameters 𝜃1,𝜃2. The adaptive knockoff filter fits this model on the data in \(\smash { \{ W^{\operatorname {ext}}, |{W^{0}_{j}}| \} }\) and \(\smash { \{{W^{0}_{pi_{j}}}: j~le~k\} }\), yielding updated estimates \(\hat {\theta }_{1}\) and \(\hat {\theta }_{2}\) at each step k. Then, the (k + 1)-th hypothesis tested by Eq. 4 is that deemed most likely to be negative; i.e., \(\pi _{k + 1} = \arg \max \limits _{j \notin \{\pi _{1}, \dots , \pi _{k}\}} \big (\hat {\theta }_{1}|{W_{j}^{0}}| + \hat {\theta }_{2} |W_{j}^{\operatorname {ext}}|\big )\). Consequently, the positive statistics tend to be pushed towards the end of the sequence, increasing power while maintaining FDR control. Of course, in general there is no particular reason for the adaptive knockoff filter to rely on this simple logistic model; instead, any model can be exploited at the k-th step to predict the remaining unknown statistics signs from the data in \(\{ W^{\operatorname {ext}}, |{W^{0}_{j}}| \} \cup \smash { \{{W^{0}_{pi_{j}}}: j~le~k\} }\) and any other relevant prior knowledge.

2.4 Transfer Learning with Prior-Informed Knockoff Statistics

The transfer learning approaches described in Sections 2.22.3 are based on statistics whose signs are determined entirely by the observations in the target environment, independently of all external data. Indeed, the only difference between these two methods is the order in which they filter the statistics (4). However, the flexibility of the knockoffs framework allows the external data to also inform the signs of the test statistics, possibly further increasing power. In fact, it is well-known that any available prior information can be directly incorporated into the predictive model of \(\mathbf {Y} \mid \mathbf {X}, \tilde {\mathbf {X}}\) utilized to compute the test statistics input to the standard knockoff filter (Candès et al. 2018), since any model can be employed for this purpose. However, it is unclear how to best take advantage of such extreme flexibility in order to maximize power. Below, we present a concrete implementation of this procedure which intuitively generalizes the sparse generalized linear model (lasso) statistics reviewed in Section 2.1 and often performs well in practice; this approach takes inspiration from the recent work of Li et al. (2021) which studied the related problem of testing for associations that are consistent across many environments.

Consider fitting a sparse generalized linear regression model of \(\mathbf {Y}^{0} \mid \mathbf {X}^{0}, \tilde {\mathbf {X}}^{0}\) with feature-specific 1 regularization parameters λj > 0 defined for j ∈{1,…,2p} as:

$$ \lambda_{j} = (1-\gamma) \lambda + \gamma \phi_{j}. $$

Above, λ > 0 and γ ∈ (0,1) are hyper-parameters tuned by cross-validation, and ϕj > 0 is a symmetric inverse measure of the prior importance of Xj or \(\tilde {X}_{j}\) satisfying ϕj = ϕj+p for all j ∈{1,…,p}. In particular, ϕj should take smaller values for more promising variables; more details about how this can be computed will be provided later. The estimated regression coefficients for the original variables and knockoffs are then combined pairwise as in Section 2.1 to obtain the final “weighted-lasso” knockoff statistics, to which we refer as \(W_{j}^{\operatorname {wl}}\). In the special case of γ = 0, or ϕj = 1 for all j, this reduces to the standard solution that does not leverage any external information. As γ is tuned to maximize the predictive accuracy of the model within the target environment, this parameter should be close to zero if the external information is not helpful; by contrast, larger values γ will tend to be selected if the data from the other environments are relevant. In the latter case, we expect truly important variables to receive weaker regularization and thus be more likely to contribute actively to the sparse model, thereby allowing us to reduce the noise level, possibly resulting in more powerful statistics. The symmetry requirement that ϕj = ϕj+p for all j ∈{1,…,p} ensures the final statistics satisfy the usual flip-sign property (Candès et al. 2018) necessary to guarantee FDR control with the knockoff filter.

Proposition 2.2.

Let Wwl be the knockoff statistics for the target environment e = 0 computed with the weighted lasso method described above, based on prior importance weights ϕ computed on data from the external environments e ∈{1,…,E}. Conditional on |Wwl|, the signs of \(W_{j}^{\operatorname {wl}}\) for all null j corresponding to a true \({H_{j}^{0}}\) (1) are i.i.d. coin flips.

The above FDR guarantee allows the prior weights ϕj to be completely arbitrary, as long as they only depend on the external data from other environments or on other prior knowledge. As a concrete example, we consider \(\smash { \phi _{j} = 1 / (0.05 + |\hat {b}^{\operatorname {ext}}_{j}| + |\hat {b}^{\operatorname {ext}}_{j+p}| ) }\) for j ∈{1,…,p}, where \(\smash { \hat {b}^{\operatorname {ext}}_{j} }\) and \(\smash { \hat {b}^{\operatorname {ext}}_{j+p}} \) are the estimated coefficients of the (scaled) variables Xj and \(\tilde {X}_{j}\), respectively, in a sparse regression model fitted on the external data. Of course, there is no true need here to analyze the knockoffs from the other environments because the original variables already contain all possible relevant knowledge. Therefore, an equally valid simpler alternative would be to compute the above ϕj without including \(\smash { |\hat {b}^{\operatorname {ext}}_{j+p}| }\). Nonetheless, we will adhere to the current choice in this paper because it is convenient if a previous knockoff analysis was carried out on the data from the external environments, in which case one can simply recycle the already fitted regression model, and it is particularly useful to introduce an interesting variation of this method discussed below, in which the explicit inclusion of the knockoffs becomes important. In the special case where the sets of nulls in the sense of \({H_{j}^{e}}\) (1) are the same for all environments e ∈{0,…,E}, even more flexibility is allowed in the construction of the prior weights ϕ. For example, consider setting \(\smash { \phi _{j} = 1 / (0.05 + |\hat {b}^{\operatorname {pool}}_{j}| + |\hat {b}^{\operatorname {pool}}_{j+p}| ) }\), for j ∈{1,…,p}, where \(\smash { \hat {b}^{\operatorname {pool}}_{j} }\) and \(\smash { \hat {b}^{\operatorname {pool}}_{j+p}} \) are the estimated coefficients of the variables Xj and \(\tilde {X}_{j}\), respectively, in a sparse regression model fitted on the pooled data obtained by combining the external sets (e ∈{1,…,E}) including also the observations from the target environment (e = 0). Then, imagine computing test statistics Wwl by applying the weighted lasso method described above to this ϕ. If the null variables are the same in all environments, the result of Proposition 2.2 still holds even though ϕ is not independent of the data in the target environment, as established by the next proposition. The intuition behind this result is that the proof of Proposition 2.2 can be suitably modified by randomly swap** null variables with their corresponding knockoffs simultaneously in all environments instead of acting only within the target one; however, this leaves the joint distribution of \([\mathbf {X},\tilde {\mathbf {X}}, \mathbf {Y}]\) invariant, as required by the knockoff filter to ensure FDR control (Candès et al. 2018), if and only if the variables which are null for e = 0 are also null for all other e ∈{1,…,E}.

Proposition 2.3.

Let Wwl be the knockoff statistics for the target environment e = 0 computed with the weighted lasso method described above, based on prior importance weights ϕ computed on the pooled data set obtained by combining the observations from all environments e ∈{0,1,…,E}. Assume all variables which are null in the target environment are also null in all other environments e ∈{1,…,E}. Then, conditional on |Wwl|, the signs of \(W_{j}^{\operatorname {wl}}\) for all null j corresponding to a true \({H_{j}^{0}}\) (1) are i.i.d. coin flips.

3 Numerical Experiments

A software implementation of our methods is available online at https://github.com/lsn235711/transfer_knockoffs_code, along with code to reproduce the analyses.

3.1 Synthetic Data

We begin by comparing empirically the performances of the different approaches to transfer learning from Sections 2.22.4 on synthetic data. Here, the adaptive knockoff filter is run with the “gam” filter (Ren and Candès, 2022), while the weighted-lasso knockoff statistics are obtained with \(\smash { \phi _{j} = 1 / (0.05 + |\hat {b}^{\operatorname {pool}}_{j}| + |\hat {b}^{\operatorname {pool}}_{j+p}| ) }\), for j ∈{1,…,p}, where \(\smash { \hat {b}^{\operatorname {pool}}_{j} }\) and \(\smash { \hat {b}^{\operatorname {pool}}_{j+p}} \) are defined as in Section 2.4. As it is not clear a priori how to best tune the parameter 𝜃 needed by the linearly re-ordered knockoff statistics, we utilize an imaginary oracle to select the value of 𝜃 yielding the largest number of discoveries in each experiment. Of course, this is not guaranteed to control the FDR in theory and hence it may not be a valid approach in practice, but it provides an informative comparison with the other two methods within the scope of these simulations. As benchmarks, we consider the following two approaches: (i) the vanilla knockoffs analysis (Candès et al. 2018) applied only to the data from target environment; and (ii) the pooling heuristic in which the standard knockoffs methodology is applied to the pooled data from all environments. Both benchmarks are applied using the standard lasso-based statistics reviewed in Section 2.1, as in Candès et al. (2018). Simulated data are obtained from 3 different environments, consisting of p = 500 variables and n = 800 observations per environment. In all environments, the variables are generated from an autoregressive model of order one with correlation parameter ρ = 0.5, and the knockoffs are constructed based on the true \({P_{X}^{e}}\) with the standard algorithm from Candès et al. (2018). The conditional distribution of YeXe in the e-th environment is given by Ye = Xeβe + 𝜖e, where \(\beta ^{e} \in \mathbb {R}^{p}\) is an environment-specific effect parameter vector, and 𝜖e are i.i.d. standard Gaussian noise. Note that here the model generating the outcome Y varies across environments. In each environment, 60 entries of βe are equal to \(a/\sqrt {n}\), with a = 3.5, while the others are zero. The sets of non-zero entries of β in each environment, \(S^{0}, S^{1}, S^{2} \subseteq \{1,\ldots ,p\}\) respectively, are chosen at random such that S1 = S2, while the overlap (the proportion of shared elements) between S0 and S1 is varied as a control parameter. Our goal is to find which variables are non-null in the first environment (i.e., to discover S0), controlling the false discovery rate below 10%. All experiments are repeated 500 times, averaging the empirical false discovery proportion and power.

Figure 1 compares the performance of the three transfer learning methods and of the two benchmarks as a function of the overlap between S0 and S1. The results show that transfer learning helps increase power compared to the vanilla method of Candès et al. (2018) if the external environments are sufficiently similar to the target one, while always controlling the FDR both when predicted by the theory, and when adopting the oracle method that does not have rigorous guarantees. In the case of the adaptive knockoff filter and of the weighted-lasso statistics, transfer learning does not seem to hurt power even if the target environment is completely different from the others (zero overlap). In contrast, the method with linearly re-ordered statistics suffers from lower power if the target environment is very different. The power loss is more apparent if 𝜃 is fixed (see Fig. 4 in Appendix B). Unsurprisingly, pooling does not control the false discovery rate, as it tends to report any variables that are non-null in at least one environment, not necessarily the target one. Finally, Fig. 4 in Appendix B shows the performance of knockoffs with linearly re-ordered knockoffs statistics applied with different fixed choices of 𝜃. These results suggest that larger values of 𝜃 make the power more sensitive to overlap between S0 and the other environments: if overlap is high, the power is significantly larger than that of the vanillas knockoffs, but if the overlap is low, the power becomes much lower.

Figure 1
figure 1

Performance of different transfer learning methods on simulated data, compared to two benchmarks. Each point averages the results of 500 independent experiments

3.2 Real Genotype Data and Simulated Phenotypes

The performances of all methods and benchmarks from the previous section are now compared on simulated but realistic GWAS data. For this purpose, we utilize the genetic data for the UK Biobank (Bycroft et al. 2018) samples with self-reported ancestry in one of the following five distinct populations: African (n = 7,635), Asian (n = 3,284), British (n = 429,934), non-British European (n = 28,994), and Indian (n = 7,628), for a total of 477,475 individuals, as in Li et al. (2021). For each of these individuals, we focus on the variants from chromosome one, disregarding the other chromosomes for simplicity. Following the same approach of Sesia et al. (2021), we only analyze biallelic single nucleotide polymorphisms (SNPs) with minor allele frequency above 0.1% and in Hardy-Weinberg equilibrium (p-value above 10− 6) among the subset of 350,119 unrelated British individuals previously analyzed in Sesia et al. (2020). The genotyped variants are partitioned into contiguous blocks in high linkage disequilibrium, so that their median width is 208 kb. These blocks are obtained by applying complete-linkage hierarchical clustering using the genetic distances, which are measured in centimorgan and estimated in a European population (The International HapMap 3 Consortium, 2010); see Sesia et al. (2021) for additional details. As in previous work on knockoffs for GWAS (Sesia et al. 2020) our goal is to discover which of the above SNP blocks are likely to contain distinct association signals, focusing of course on the analysis of the data from the target environment. The SNP blocks mentioned above act as the fundamental units of inference in our analysis, so that the fundamental conditional hypotheses Hj defined in Eq. 1 are effectively replaced with the slightly more general

(7)

where \(G \subseteq \{1,\ldots ,p\}\) indicates a block of SNPs, leaving however all other aspects of the analysis unchanged compared to Section 2. The advantage of this approach is that it allows us to control the trade-off between power and resolution, as the conditional hypotheses corresponding to smaller SNP blocks are naturally more informative, as they allow us to localize the significant genetic associations more precisely, but they are also inevitably more difficult to reject (Sesia et al. 2020). In practice, we utilize in these experiments the same groups of SNPs and the corresponding knockoffs at the 208kb resolution as in Sesia et al. (2021). The true model for the phenotype is determined by randomly picking 100 SNPs as the “causal variants”, ensuring that they all come from distinct SNP blocks, and varying as a control parameter the heterogeneity across populations of their minor allele frequencies. When the heterogeneity parameter is 0%, all causal variants have approximately equal minor allele frequencies in all five populations. When the heterogeneity parameter is 100%, each of the five populations is assigned 20 specific causal variants among those with the highest frequency in that population and lowest possible frequency in all others, consistently with the constraint that each block should contain at most one causal variant. In the intermediate cases in which the heterogeneity parameter is between 0% and 100%, a corresponding fraction of causal variants are chosen with the first method above while the remaining ones are chosen with the second method, interpolating between those two extreme approaches. Note that all 100 causal variants may be present in all populations, although with different frequencies, and have a causal effect in all of them. The causal effect sizes are however population-specific. In particular, for each population, the effect sizes of all 100 causal variants are independent and identically distributed within the interval [0.1,10]. The signs of the causal effects are independent coin flips but remain constant across populations. Conditional on the genotypes and on the above causal effects, the synthetic phenotypes are generated from a linear model with homoscedastic Gaussian noise: \(Y^{e} \sim \mathcal {N}(X^{e}\beta ^{e}, \sigma ^{2})\). Above, Xe and Ye indicate the genotypes and phenotypes in the e-th population, respectively, while βe is the vector containing the signed effect sizes for all 100 causal variants. The noise variance σ2 is fixed so that the signal-to-noise ratio is 5% or 10%, depending on the setting. The goal of this analysis is to localize the 100 blocks of SNPs containing causal variants controlling the FDR below 10%.

Figure 2 compares the performances of all methods applied to the above data as a function of the heterogeneity of the causal variants, in the case of 10% signal-to-noise-ratio. Each method is applied to the data from the population labeled in the corresponding column, while the transfer learning prior information is obtained by applying knockoffs to the pooled data set of all UK Biobank individuals—note that this is a valid solution because the support of the causal model is the constant across populations. The results demonstrate that all transfer learning methods control the FDR, as predicted by the theory, but the weighted lasso approach is the most powerful. Note that the sample sizes for these four minority groups differ meaningfully, largely explaining the variance in the performances of all methods across populations. However, these groups also differ in their genetic similarity to the British population, and hence in the amount of transferable information, which means that one should not expect transfer learning to perform equally well even if the sample sizes were all the same. Indeed, Fig. 5 in Appendix B reports analogous results obtained from the analysis of smaller data sets with equal sample sizes across populations (n = 3284, the number of individuals belonging to the Asian population, which is the smallest environment in this data set), showing that transfer learning from the British population tends to be most effective when the target population is the European one, especially if the heterogeneity of the causal allele frequencies is high.

Figure 2
figure 2

Performances of different knockoff methods for transfer learning applied to simulated GWAS data with real genotypes from different populations. The nominal FDR level is 10%. The empirical FDR and power are averaged over 10 experiments with independent phenotypes. The last column summarizes the results of a standard knockoff analysis of the pooled data from all populations

Figure 3 compares the performances of the different transfer learning methods evaluated in a population-specific sense, when the signal-to-noise ratio is 5%. More precisely, we imagine assigning each variant, whether causal or not, to the population in which it displays the largest minor allele frequency. Then, the FDR and power of our analysis are evaluated separately in each population by counting only the specific discoveries and true causal variants assigned to it. The intuition is that these are the specific discoveries that will likely be most useful for building effective predictive models for the population of interest. Unfortunately, controlling the false discovery rate becomes difficult in this setting, as some of the reported associations are discarded post-hoc, because one generally has no guarantee that the expected spurious findings are not disproportionately concentrated among the selected ones (Katsevich et al. 2021). Although the transfer learning methods described in this paper are not the only possible approach to mitigate this problem, and in truth they do not even explicitly solve the post-hoc filtering challenge explained above, we will demonstrate that they can be useful to highlight discoveries that are significant both statistically and practically within the target environment. The results show that all transfer learning methods considered here always control the population-specific FDR in practice, although they are only theoretically guaranteed to control the global FDR over all reported variants. By contrast, the vanilla knockoffs method applied to the British samples alone does not always empirically control the population-specific FDR in other populations, even though one may have intuitively expected it to be valid also in this sense since the causal model is by design the same regardless of ancestry. However, the issue with the vanilla single-population analysis here is that there is relatively little information in the British samples about the variants which are specific to other groups, and so the knockoff filter is more likely to make mistakes on them. Figure 6 in Appendix B shows similar results in the case of signal-to-noise-ratio equal to 10%; the population-specific FDR violation of the vanilla approach is still noticeable here but it is smaller and not statistically significant. Finally, Fig. 7 shows that pooling appears to empirically control the population-specific FDR in the setting of Fig. 3, even though the vanilla approach applied to British samples did not.

Figure 3
figure 3

Performances of different knockoff methods for transfer learning applied to simulated GWAS data, as in Fig. 2. The FDR and power are population-specific: they only count discoveries whose minor allele frequency is highest in the population of interest

4 Real Data Analysis

We apply the different transfer learning methods discussed in this paper to study several real phenotypes in the UK Biobank resource, using the same data and knockoffs as in Sesia et al. (2021). Following the approach of Section 3.2, we consider five environments based on the self-reported ancestry of each individual, separately utilizing as target environments each of the four minority populations: African, Asian, non-British European, and Indian. In each case, the test statistics computed on the British samples are used as prior information. In particular, we consider two types of prior information: single-phenotype prior information and multi-phenotype prior information. The former consists of the importance statistics obtained from the analysis of the phenotype of interest in the British population, while the latter includes also the analogous information obtained from the analysis of other related phenotypes with potentially similar genetic architectures. All data are pre-processed based on the protocol of Sesia et al. (2021), which consists of filtering SNPs and individuals based on standard quality-control criteria, clustering the genotypes into hierarchical blocks at different resolutions (with median widths equal to 3, 20, 41, 81, 208, 425 kb, respectively) following the same approach described in Section 3.2, and generating the corresponding knockoffs at each resolution. For all methods, the knockoff filter (SeqStep) of Barber and Candès (2015) is applied separately at each resolution (Sesia et al. 2020). The offset parameter of this filter is set equal to 0, which tends to be more powerful than the standard implementation but does not theoretically control the FDR if the number of discoveries is very small.

4.1 Leveraging Single-Phenotype Prior Information

We analyze here three phenotypes: platelet count, standing height, and body mass index (BMI). For each phenotype, we make use of the feature importance statistic W corresponding to the same phenotype obtained from the British population. The three transfer learning methods introduced in Section 2 are implemented and compared with the results of the vanilla knockoffs procedure. Table 1 summarizes the numbers of discoveries for platelet count. The results show that transfer learning yields more discoveries than vanilla knockoffs, confirming the advantage of leveraging prior information. The knockoffs procedure with weighted lasso statistics leads to the most discoveries, demonstrating the benefit of refitting the predictive model. Analogous results, although with fewer discoveries, are shown in Tables 3 and 4 (Appendix C) for height and BMI, respectively. The full lists of discoveries are available online from https://msesia.github.io/knockoffgwas/ukbiobank.

Table 1 Numbers of discoveries for platelet count reported by vanilla knockoffs and three transfer learning methods applied to data from different minority populations in the UK Biobank, leveraging prior information from a much larger number of British samples. The discoveries are reported separately at six different levels of resolution. The nominal FDR level is 10% at each resolution

4.2 Leveraging Multi-phenotype Prior Information

We now consider four related phenotypes that may share some similarities in their genetic architectures: BMI, systolic blood pressure (SBP), diabetes, and cardiovascular disease (CVD). These four phenotypes are separately studied using the British samples, and the results of all those analyses are utilized jointly as prior information for all specific studies in the minority populations. Concretely, before studying any phenotype in the minority populations, we first compute on the data from the British samples the 2p-dimensional estimated coefficients \(\hat {b}^{\text {diabetes}}, \hat {b}^{\text {BMI}},\hat {b}^{\text {CVD}}, \hat {b}^{\text {SBP}}\)—and the corresponding knockoff test statistics Wdiabetes,WBMI,WCVD, WSBP—for diabetes, BMI, CVD, and SBP, respectively. Then, the adaptive knockoffs method is simply applied to each minority study with the four-dimensional vector \((W_{j}^{\text {diabetes}},\) \(W_{j}^{\text {BMI}}, W_{j}^{\text {CVD}}, W_{j}^{\text {SBP}})\) as prior information input (Ren and Candès, 2022) for each SNP j ∈{1,…,p}, and utilizing the single-source weighted lasso statistics as the test statistic. In the case of transfer learning with weighted-lasso statistics, which require one-dimensional prior information for each SNP, we combine \((\hat {b}_{j}^{\text {diabetes}}\), \(\hat {b}_{j}^{\text {BMI}}\), \(\hat {b}_{j}^{\text {CVD}}, \hat {b}_{j}^{\text {SBP}})\) by taking the mean and then proceed as in Section 2.4; we will henceforth refer to this method as the knockoffs procedure with multi-weighted lasso statistics. Table 2 reports the number of findings for BMI obtained from the analysis of the European samples. These results show that transfer learning increases power, consistently with the numerical experiments in Section 3. Tables 67 and 8 in Appendix B summarize the analogous numbers of discoveries corresponding to the other minority populations and phenotypes. Again, one can see that transfer learning tends to lead to some benefits, although for these phenotypes the power of all methods is often very low in all but the European population, likely due to the relatively small sample sizes. In this example, taking the average of the prior information from difference sources does not improve the performance of transfer learning with weighted lasso statistics, possibly because it is not optimal to assign an equal weight to each source. The adaptive knockoffs methodology offers a more principled and tuning-free approach for dealing with multi-dimensional prior information, and in our analysis, we see improvement in some cases. Since here both the power and improvement are small, we do not attempt to draw conclusions on the performance of the methods, but rather try to demonstrate the possibility of simultaneously leveraging prior information from multiple sources which differ both in their underlying populations and in the phenotypes studied.

Table 2 Numbers of discoveries for BMI, using UK Biobank data from non-British European samples and leveraging prior knowledge on BMI, systolic blood pressure, diabetes, and cardiovascular disease acquired from the analysis of British samples. The discoveries are reported at six different levels of resolution. The nominal FDR level is 10% at each resolution

5 Discussion

This paper demonstrated that incorporating relevant knowledge from external data sets can significantly improve the power of conditional testing with knockoffs, especially if the available prior information is carefully leveraged to fit a more accurate predictive model for the computation of the test statistics. The empirical results obtained on synthetic and real data suggest that these methods can be useful for the analysis of GWAS data, allowing one to increase the number of discoveries by borrowing strength from possibly larger studies which may either involve populations with different ancestries or focus on different phenotypes. Transfer learning is indeed particularly appealing in such cases because the naive alternative of pooling all the data would not be a satisfactory solution, either because it may skew our findings towards those which are most relevant for the most common population, or because it would not lead to the discovery of the variables that are truly important for the outcome of interest. An intriguing direction for future research would involve investigating possible connections between the transfer learning for conditional testing studied in this paper, and the more traditional task of predictive transfer learning, with the ultimate goal of computing more accurate estimates of genetic risk for populations which have been historically underrepresented in GWAS data. In fact, one may intuitively expect that predictive models based on variables selected with the methods in this paper may perform better than those obtained with alternative variable selection approaches that either do not leverage relevant information gained by analyzing data from different populations, or do not correctly account for genetic diversity across populations.

Regarding future applications to GWAS data, it is possible to apply the transfer learning methods in Section 4.2 to leverage prior information acquired from the study of related phenotypes within the same population as the target one (e.g., the British population in the UK Biobank). One must though be careful to avoid utilizing data from the same individuals twice, as our theoretical results assume the prior information to be independent of the test statistics computed on the target data set. Since the sampling randomness upon which knockoff inferences are based arises from the distribution of the genotypes, not that of the phenotypes, our transfer learning methods would not be guarantee to control the FDR if the prior depended on observations involving individuals also present in the target data set, regardless of the relation between the different phenotypes. However, it would be correct to utilize two different subsets of British samples to acquire potentially useful prior information and to calculate the test statistics for the phenotype of interest. This may be particularly relevant if the phenotype is only measured in relatively few people or, in the case of a binary trait, there are many more healthy controls than disease cases. Then, transfer learning would be a natural solution to gain strength from data on more common conditions suspected to share some similarities in genetic architectures, while retaining guaranteed FDR control for the study of interest.