Transfer learning for Bayesian discovery of multiple Bayesian networks

Oyen, Diane; Lane, Terran

doi:10.1007/s10115-014-0775-6

Transfer learning for Bayesian discovery of multiple Bayesian networks

Regular paper
Published: 17 August 2014

Volume 43, pages 1–28, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Diane Oyen¹ &
Terran Lane²

650 Accesses
9 Citations
Explore all metrics

Abstract

Bayesian network structure learning algorithms with limited data are being used in domains such as systems biology and neuroscience to gain insight into the underlying processes that produce observed data. Learning reliable networks from limited data is difficult; therefore, transfer learning can improve the robustness of learned networks by leveraging data from related tasks. Existing transfer learning algorithms for Bayesian network structure learning give a single maximum a posteriori estimate of network models. Yet, many other models may be equally likely, and so a more informative result is provided by Bayesian structure discovery. Bayesian structure discovery algorithms estimate posterior probabilities of structural features, such as edges. We present transfer learning for Bayesian structure discovery which allows us to explore the shared and unique structural features among related tasks. Efficient computation requires that our transfer learning objective factors into local calculations, which we prove is given by a broad class of transfer biases. Theoretically, we show the efficiency of our approach. Empirically, we show that compared to single-task learning, transfer learning is better able to positively identify true edges. We apply the method to whole-brain neuroimaging data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Collective relational inference for learning heterogeneous interactions

Article Open access 12 April 2024

Hierarchical Region-Network Sparsity for High-Dimensional Inference in Brain Imaging

Network-Based Unsupervised Learning

Notes

References

Bailey WN (1935) Generalised hypergeometric series. University Press, Cambridge
Google Scholar
Beinlich IA, Suermondt HJ, Chavez RM, Cooper GF (1989) The ALARM monitoring system: a case study with two probabilistic inference techniques for belief networks. In: Second European conference on artificial intelligence in medicine, vol 38, pp 247–256
Björklund A, Husfeldt T, Kaski P, Koivisto M (2007) Fourier meets Möbius: fast subset convolution. In: Proceedings of the thirty-ninth annual ACM symposium on theory of computing, ACM, pp 67–74
Buntine W (1991) Theory refinement on Bayesian networks. In: Seventh conference on uncertainty in artificial intelligence, pp 52–60
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Article MathSciNet Google Scholar
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347
MATH Google Scholar
Cooper G, Yoo C (1999) Causal discovery from a mixture of experimental and observational data. In: Conference on uncertainty in artificial intelligence, pp 116–125
Friedman N, Koller D (2003) Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Mach Learn 50(1):95–125
Article MATH Google Scholar
Grzegorczyk M, Husmeier D (2008) Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Mach Learn 71(2–3):265–305
Article Google Scholar
Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20(3):197–243
MATH Google Scholar
Koivisto M (2006) Advances in exact Bayesian structure discovery in Bayesian networks. In: Twenty-second conference annual conference on uncertainty in artificial intelligence, pp 241–248
Koivisto M, Sood K (2004) Exact Bayesian structure discovery in Bayesian networks. J Mach Learn Res 5:549–573
MATH MathSciNet Google Scholar
Lauritzen S, Spiegelhalter D (1988) Local computations with probabilities on graphical structures and their application to expert systems. J R Stat Soc B (Methodological) 50:157–224
Luis R, Sucar LE, Morales EF (2010) Inductive transfer for learning Bayesian networks. Mach Learn 79(1–2):227–255
Article MathSciNet Google Scholar
Madigan D, York J, Allard D (1995) Bayesian graphical models for discrete data. Int Stat Rev 63(2):215–232
Niculescu-Mizil A, Caruana R (2007) Inductive transfer for Bayesian network structure learning. In: Eleventh international conference on artificial intelligence and statistics
Niinimaki T, Parviainen P, Koivisto M (2011) Partial order MCMC for structure discovery in Bayesian networks. In: Twenty-seventh annual conference on uncertainty in artificial intelligence, pp 557–564
Oyen D, Lane T (2012) Leveraging domain knowledge in multitask Bayesian network structure learning. In: Twenty-Sixth AAAI conference on artificial intelligence
Parviainen P, Koivisto M (2009) Exact structure discovery in Bayesian networks with less space. In: Twenty-fifth conference on uncertainty in artificial intelligence, pp 436–443
Pearl J, Bareinboim E (2011) Transportability of causal and statistical relations: a formal approach. In: Twenty-fifth national conference on artificial intelligence, pp 247–254
Richardson M, Domingos P (2003) Learning with knowledge from multiple experts. In: ICML, vol 20, pp 624–631
Thrun S (1996) Is learning the n-th thing any easier than learning the first? Adv Neural Inf Process Syst, 8:640–646
Tong S, Koller D (2001) Active learning for structure in Bayesian networks. In: International joint conference on artificial intelligence, vol 17, pp 863–869
Werhli A, Husmeier D (2007) Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol 6(1)
Zhang J, Zhang C (2010) Multitask Bregman clustering. In: Twenty-fourth national conference on artificial intelligence

Download references

Acknowledgments

Thanks to Vincent P. Clark and the Mind Research Network for neuroimaging data. Also, thanks to Eric Eaton and Paul Ruvolo for useful discussions and feedback. Worked funded by the Office of Naval Research grant N00014-000-0000.

Author information

Authors and Affiliations

Los Alamos National Laboratory, Los Alamos, NM, USA
Diane Oyen
Google Inc., Cambridge, MA, USA
Terran Lane

Authors

Diane Oyen
View author publications
You can also search for this author in PubMed Google Scholar
Terran Lane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diane Oyen.

Appendix

1.1 Extended proof of Theorem

We derive the equality in Theorem starting from the joint probability of data and the feature conditioned on an order given in Eq. 9,

$$\begin{aligned} P(f^{(t)}, \mathcal {D} | \! \prec ) = \sum _{G^{(t)} \subseteq \! \prec } f(G^{(t)}) P(D^{(t)} | G^{(t)}) \left( \sum _{G \subseteq \! \prec } P(G^{(t)}, G | \! \prec ) \prod _{k \ne t} P(D^{(k)} | G) \right) . \end{aligned}$$

Modularity properties are used to substitute $P(D|G) = \prod _{i=1}^n P(x_i | \pi _i)$, and, $f(G) = \prod _{i=1}^n f_i(\pi _i)$, and, $P(G^{(t)}, G | \! \prec ) = \prod _{i=1}^n P(\pi _i^{(t)}, \pi _i | U_i)$, as follows,

$$\begin{aligned} P(f^{(t)}, \mathcal {D} | \! \prec )&= \sum _{\pi _1^{(t)} \subseteq U_1} \cdots \sum _{\pi _n^{(t)} \subseteq U_n} \prod _{i=1}^n \left[ f_i(\pi _i^{(t)}) P(x_i^{(t)} | \pi _i^{(t)}) \right] \\&\times \left( \sum _{\pi _1 \subseteq U_1} \cdots \sum _{\pi _n \subseteq U_n} \prod _{i=1}^n P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i) \right) . \end{aligned}$$

The goal is to factor the terms so that local calculations at each node, $i$, can be calculated independently. First, we factor terms in the structure bias (the part in round parentheses above) as follows:

$$\begin{aligned}&\sum _{\pi _1 \subseteq U_1} \cdots \sum _{\pi _n \subseteq U_n} \prod _{i=1}^n P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i)\\&\quad = \sum _{\pi _1 \subseteq U_1} \cdots \sum _{\pi _{n-1} \subseteq U_{n-1}} \prod _{i=1}^{n-1} P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i) \\&\quad \quad \times \left( \sum _{\pi _{n} \subseteq U_{n}} P(\pi _n^{(t)}, \pi _n | U_n) \prod _{k \ne t} P(x_n^{(k)} | \pi _n) \right) \\&\quad = \left( \sum _{\pi _1 \subseteq U_1} P(\pi _1^{(t)}, \pi _1 | U_1) \prod _{k \ne t} P(x_1^{(k)} | \pi _1) \right) \\&\quad \quad \times \left( \sum _{\pi _2 \subseteq U_2} P(\pi _2^{(t)}, \pi _2 | U_2) \prod _{k \ne t} P(x_2^{(k)} | \pi _2) \right) \times \cdots \\&\quad \quad \times \left( \sum _{\pi _n \subseteq U_n}P(\pi _n^{(t)}, \pi _n | U_n) \prod _{k \ne t} P(x_n^{(k)} | \pi _n) \right) \\&\quad = \prod _{i=1}^n \sum _{\pi _i \subseteq U_i} P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i).\\ \end{aligned}$$

Plugging the factored structure prior back in to the full equation, we now have:

$$\begin{aligned}&P(f^{(t)}, \mathcal {D} | \! \prec ) \\&\quad = \sum _{\pi _1^{(t)} \subseteq U_1} \!\cdots \! \sum _{\pi _n^{(t)} \subseteq U_n} \prod _{i=1}^n f_i(\pi _i^{(t)}) P(x_i^{(t)} | \pi _i^{(t)}) \left( \sum _{\pi _i \subseteq U_i} P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i) \right) . \end{aligned}$$

If we perform a similar factoring to that done above on the structure prior, we get achieve the desired result:

$$\begin{aligned} P(f^{(t)}, \mathcal {D} | \! \prec ) = \prod _{i=1}^n \sum _{\pi _i^{(t)} \subseteq U_i} f_i(\pi _i^{(t)}) P(x_i^{(t)} | \pi _i^{(t)}) \left( \sum _{\pi _i \subseteq U_i} P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i) \right) . \end{aligned}$$

1.2 Normalization constant for structure bias

Calculation of the normalization constant requires summing over all possible combinations of parent sets $(\pi _i^{(t)}, \pi _i^{(k)})$, as follows:

$$\begin{aligned} Z = \sum _{\pi _i^{(t)} \subseteq U_i} \sum _{\pi _i^{(k)} \subseteq U_i} (1-\lambda )^{\Delta _{itk}}. \end{aligned}$$

We can simplify this by first fixing parent set $\pi _i^{(t)}$ while counting how many parent sets $\pi _i^{(k)}$ will give $\Delta _{itk} = 0$ ($\pi _i^{(k)}$ can contain any parents from the set $\{U_i {\setminus } \pi _i^{(t)}\}$ but none from $\pi _i^{(t)}$); then how many $\pi _i^{(k)}$ will give $\Delta _{itk} = 1$ ($\pi _i^{(k)}$ can contain any parents from the set $\{U_i {\setminus } \pi _i^{(t)}\}$ and exactly one from $\pi _i^{(t)}$); etc, up to the maximum of $\Delta _{itk} = |\pi _i^{(t)}|$. This sum turns out to be a binomial expansion, and so we can write it in closed form. The expansion is demonstrated here:

$$\begin{aligned} Z&= \sum _{\pi _i^{(t)} \subseteq U_i} \left[ \sum _{\{\pi _i^{(k)} \subseteq U_i \big | \Delta _{itk} = 0\}}\!\!\!\!\!\!\! 1 + \sum _{\{\pi _i^{(k)} \subseteq U_i \big | \Delta _{itk} = 1\}}\!\!\!\!\!\!\!\!\!\!\!\! (1-\lambda ) + \cdots + \!\!\!\!\!\!\!\!\!\! \sum _{\{\pi _i^{(k)} \subseteq U_i \big | \Delta _{itk} = |\pi _i^{(t)}|\}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! (1-\lambda )^{| \pi _i^{(t)}|} \right] \\&= \sum _{\pi _i^{(t)} \subseteq U_i} \left[ \left( {\begin{array}{c}| \pi _i^{(t)}|\\ 0\end{array}}\right) 2^{| U_i | - | \pi _i^{(t)}|} + \left( {\begin{array}{c}| \pi _i^{(t)}|\\ 1\end{array}}\right) 2^{| U_i | - | \pi _i^{(t)}|}(1-\lambda ) + \cdots \right. \\&\left. \quad +\, \left( {\begin{array}{c}| \pi _i^{(t)}|\\ | \pi _i^{(t)}|\end{array}}\right) 2^{| U_i | - | \pi _i^{(t)}|}(1-\lambda )^{| \pi _i^{(t)}|} \right] \\&= \sum _{\pi _i^{(t)} \subseteq U_i} \sum _{j=0}^{| \pi _i^{(t)}|} \left( {\begin{array}{c}| \pi _i^{(t)}|\\ j\end{array}}\right) 2^{| U_i | - | \pi _i^{(t)}|} (1-\lambda )^j \\&= \sum _{\pi _i^{(t)} \subseteq U_i} 2^{| U_i | - | \pi _i^{(t)}|} \sum _{j=0}^{| \pi _i^{(t)}|} \left( {\begin{array}{c}| \pi _i^{(t)}|\\ j\end{array}}\right) (1-\lambda )^j \\&= 2^{| U_i |} \sum _{\pi _i^{(t)} \subseteq U_i} 2^{- | \pi _i^{(t)}|} (1 + 1 - \lambda )^{| \pi _i^{(t)}|}\\&= 2^{| U_i |} \sum _{\pi _i^{(t)} \subseteq U_i} \left( \frac{2-\lambda }{2} \right) ^{| \pi _i^{(t)}|} \\&= 2^{| U_i |} \sum _{\pi _i^{(t)} \subseteq U_i} \left( 1-\frac{\lambda }{2}\right) ^{| \pi _i^{(t)}|} . \end{aligned}$$

Next, we perform a similar expansion of the sum over parent sets $\pi _i^{(t)}$ that have size $|\pi _i^{(t)}| = 0$, and $|\pi _i^{(t)}| = 1$, etc up to the maximum $|\pi _i^{(t)}| = |U_i|$. This sum also turns out to be a binomial expansion and therefore can be simplified into a closed form, as demonstrated:

$$\begin{aligned} Z&= 2^{| U_i |} \sum _{\pi _i^{(t)} \subseteq U_i} \left( 1-\frac{\lambda }{2}\right) ^{| \pi _i^{(t)}|}\\&= 2^{| U_i |} \left[ \sum _{\{ \pi _i^{(t)} \subseteq U_i \big | | \pi _i^{(t)} | = 0\}}\!\!\!\!\!\!\! 1 + \sum _{\{ \pi _i^{(t)} \subseteq U_i \big | | \pi _i^{(t)} | = 1\}}\!\!\!\!\!\!\!\!\!\!\!\! \left( 1 - \frac{\lambda }{2} \right) + \cdots + \!\!\!\!\!\!\!\!\!\!\!\!\!\! \sum _{\{ \pi _i^{(t)} \subseteq U_i \big | | \pi _i^{(t)} | = | U_i |\}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \left( 1 - \frac{\lambda }{2} \right) ^{| U_i |} \right] \\&= 2^{| U_i |} \left[ \left( {\begin{array}{c}|U_i|\\ 0\end{array}}\right) + \left( {\begin{array}{c}|U_i|\\ 1\end{array}}\right) \left( 1-\frac{\lambda }{2} \right) + \cdots + \left( {\begin{array}{c}|U_i|\\ |U_i|\end{array}}\right) \left( 1 - \frac{\lambda }{2} \right) ^ {|U_i|} \right] \\&= 2^{| U_i |} \sum _{j=0}^{|U_i|} \left( {\begin{array}{c}|U_i|\\ j\end{array}}\right) \left( 1 - \frac{\lambda }{2}\right) ^j\\&= 2^{| U_i |} \left( 2-\frac{\lambda }{2}\right) ^{| U_i |} \\&= (4 - \lambda )^{| U_i |}. \end{aligned}$$

1.3 Integration of Bayesian model average

We integrate over all values of the parameter $\lambda $ to obtain the Bayesian model average over all possible structure priors, thus eliminating the need to select a point-estimate for $\lambda $. We use an uninformative, uniform prior for $\lambda $, i.e., $p(\lambda |U_i) = 1$ for $0 \le \lambda \le 1$.

$$\begin{aligned} P(\pi _i^{(t)}, \pi _i^{(k)} | U_i)&= \int _0^1 P(\pi _i^{(t)}, \pi _i^{(k)} | U_i, \lambda ) p(\lambda | U_i) d\lambda \\&= \int _0^1 \frac{(1-\lambda ) ^{\Delta _{itk}}}{(4 - \lambda )^{|U_i|}} p(\lambda | U_i) d\lambda \\&= \int _0^1 \frac{(1-\lambda ) ^ {\Delta _{itk}}}{(4 - \lambda )^{|U_i|}} d\lambda \\ \end{aligned}$$

Next, we use an identity given by Euler in 1748 [1]. If $\beta $ is the beta function and $_2\text {F}_{\!1}$ is the ordinary hypergeometric function, then

$$\begin{aligned} \int _0^1 x^{b-1}(1-x)^{c-b-1}(1-zx)^{-a} dx = \beta (b, c-b) _2\text {F}_{\!1} (a, b; c; z) \quad \text {for } \mathfrak {R}(c) > \mathfrak {R}(b) > 0 . \end{aligned}$$

Let $x = \lambda ,\,a = |U_i|,\,b = 1,\,c = \Delta _{itk} + 2$, and $z = 1/4$. Then the condition, $\Delta _{itk} + 2 > 1 > 0$, holds for any $\Delta _{itk} \ge 0$ which is the valid range for $\Delta _{itk}$ and:

$$\begin{aligned} \int _0^1\!\! \lambda ^{0}(1-\lambda )^{\Delta _{itk}}(1-\lambda /4)^{-|U_i|} d\lambda&= \beta (1, \Delta _{itk} + 1) _2\text {F}_{\!1} (|U_i|, 1; \Delta _{itk} + 2; 1/4) \\ \int _0^1 \frac{4^{|U_i|} (1-\lambda )^{\Delta _{itk}}}{(4-\lambda )^{|U_i|}} d \lambda&= \left( \frac{0! \Delta _{itk}!}{(\Delta _{itk} + 1)!}\right) {_2}\text {F}_{\!1} (|U_i|, 1; \Delta _{itk} + 2; 1/4) \\ \int _0^1 \frac{(1-\lambda )^{\Delta _{itk}}}{(4-\lambda )^{|U_i|}} d \lambda&= \left( \!\frac{1}{4^{|U_i|} (\Delta _{itk} + 1)} \!\right) {_2}\text {F}_{\!1} (|U_i|, 1; \Delta _{itk} + 2; 1/4) ,\\ \end{aligned}$$

giving the result:

$$\begin{aligned} P(\pi _i^{(t)}, \pi _i^{(k)} | U_i) = \frac{_2\text {F}_{\!1}(|U_i|, 1; \Delta _{itk}+2; 1/4)}{4^{|U_i|} (\Delta _{itk} + 1)} \end{aligned}$$

We only need the solution to this formula for a limited number of combinations of integer values of $\Delta _{itk}$ and $|U_i|,\,0 \le \Delta _{itk} \le |U_i| < n$, where $n$ is the number of variables in the network. Therefore, we pre-compute a lookup table of the necessary values using the GNU Scientific Library^{Footnote 2} hypergeometric function solver.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oyen, D., Lane, T. Transfer learning for Bayesian discovery of multiple Bayesian networks. Knowl Inf Syst 43, 1–28 (2015). https://doi.org/10.1007/s10115-014-0775-6

Download citation

Received: 13 December 2013
Revised: 22 May 2014
Accepted: 27 July 2014
Published: 17 August 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s10115-014-0775-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Transfer learning for Bayesian discovery of multiple Bayesian networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Collective relational inference for learning heterogeneous interactions

Hierarchical Region-Network Sparsity for High-Dimensional Inference in Brain Imaging

Network-Based Unsupervised Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Extended proof of Theorem

1.2 Normalization constant for structure bias

1.3 Integration of Bayesian model average

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Transfer learning for Bayesian discovery of multiple Bayesian networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Collective relational inference for learning heterogeneous interactions

Hierarchical Region-Network Sparsity for High-Dimensional Inference in Brain Imaging

Network-Based Unsupervised Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Extended proof of Theorem

1.2 Normalization constant for structure bias

1.3 Integration of Bayesian model average

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation