Log in

Transfer learning for Bayesian discovery of multiple Bayesian networks

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Bayesian network structure learning algorithms with limited data are being used in domains such as systems biology and neuroscience to gain insight into the underlying processes that produce observed data. Learning reliable networks from limited data is difficult; therefore, transfer learning can improve the robustness of learned networks by leveraging data from related tasks. Existing transfer learning algorithms for Bayesian network structure learning give a single maximum a posteriori estimate of network models. Yet, many other models may be equally likely, and so a more informative result is provided by Bayesian structure discovery. Bayesian structure discovery algorithms estimate posterior probabilities of structural features, such as edges. We present transfer learning for Bayesian structure discovery which allows us to explore the shared and unique structural features among related tasks. Efficient computation requires that our transfer learning objective factors into local calculations, which we prove is given by a broad class of transfer biases. Theoretically, we show the efficiency of our approach. Empirically, we show that compared to single-task learning, transfer learning is better able to positively identify true edges. We apply the method to whole-brain neuroimaging data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://www.cs.unm.edu/~doyen/software.

  2. http://www.gnu.org/software/gsl/.

References

  1. Bailey WN (1935) Generalised hypergeometric series. University Press, Cambridge

    Google Scholar 

  2. Beinlich IA, Suermondt HJ, Chavez RM, Cooper GF (1989) The ALARM monitoring system: a case study with two probabilistic inference techniques for belief networks. In: Second European conference on artificial intelligence in medicine, vol 38, pp 247–256

  3. Björklund A, Husfeldt T, Kaski P, Koivisto M (2007) Fourier meets Möbius: fast subset convolution. In: Proceedings of the thirty-ninth annual ACM symposium on theory of computing, ACM, pp 67–74

  4. Buntine W (1991) Theory refinement on Bayesian networks. In: Seventh conference on uncertainty in artificial intelligence, pp 52–60

  5. Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75

    Article  MathSciNet  Google Scholar 

  6. Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347

    MATH  Google Scholar 

  7. Cooper G, Yoo C (1999) Causal discovery from a mixture of experimental and observational data. In: Conference on uncertainty in artificial intelligence, pp 116–125

  8. Friedman N, Koller D (2003) Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Mach Learn 50(1):95–125

    Article  MATH  Google Scholar 

  9. Grzegorczyk M, Husmeier D (2008) Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Mach Learn 71(2–3):265–305

    Article  Google Scholar 

  10. Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20(3):197–243

    MATH  Google Scholar 

  11. Koivisto M (2006) Advances in exact Bayesian structure discovery in Bayesian networks. In: Twenty-second conference annual conference on uncertainty in artificial intelligence, pp 241–248

  12. Koivisto M, Sood K (2004) Exact Bayesian structure discovery in Bayesian networks. J Mach Learn Res 5:549–573

    MATH  MathSciNet  Google Scholar 

  13. Lauritzen S, Spiegelhalter D (1988) Local computations with probabilities on graphical structures and their application to expert systems. J R Stat Soc B (Methodological) 50:157–224

  14. Luis R, Sucar LE, Morales EF (2010) Inductive transfer for learning Bayesian networks. Mach Learn 79(1–2):227–255

    Article  MathSciNet  Google Scholar 

  15. Madigan D, York J, Allard D (1995) Bayesian graphical models for discrete data. Int Stat Rev 63(2):215–232

  16. Niculescu-Mizil A, Caruana R (2007) Inductive transfer for Bayesian network structure learning. In: Eleventh international conference on artificial intelligence and statistics

  17. Niinimaki T, Parviainen P, Koivisto M (2011) Partial order MCMC for structure discovery in Bayesian networks. In: Twenty-seventh annual conference on uncertainty in artificial intelligence, pp 557–564

  18. Oyen D, Lane T (2012) Leveraging domain knowledge in multitask Bayesian network structure learning. In: Twenty-Sixth AAAI conference on artificial intelligence

  19. Parviainen P, Koivisto M (2009) Exact structure discovery in Bayesian networks with less space. In: Twenty-fifth conference on uncertainty in artificial intelligence, pp 436–443

  20. Pearl J, Bareinboim E (2011) Transportability of causal and statistical relations: a formal approach. In: Twenty-fifth national conference on artificial intelligence, pp 247–254

  21. Richardson M, Domingos P (2003) Learning with knowledge from multiple experts. In: ICML, vol 20, pp 624–631

  22. Thrun S (1996) Is learning the n-th thing any easier than learning the first? Adv Neural Inf Process Syst, 8:640–646

  23. Tong S, Koller D (2001) Active learning for structure in Bayesian networks. In: International joint conference on artificial intelligence, vol 17, pp 863–869

  24. Werhli A, Husmeier D (2007) Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol 6(1)

  25. Zhang J, Zhang C (2010) Multitask Bregman clustering. In: Twenty-fourth national conference on artificial intelligence

Download references

Acknowledgments

Thanks to Vincent P. Clark and the Mind Research Network for neuroimaging data. Also, thanks to Eric Eaton and Paul Ruvolo for useful discussions and feedback. Worked funded by the Office of Naval Research grant N00014-000-0000.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diane Oyen.

Appendix

Appendix

1.1 Extended proof of Theorem 

We derive the equality in Theorem  starting from the joint probability of data and the feature conditioned on an order given in Eq. 9,

$$\begin{aligned} P(f^{(t)}, \mathcal {D} | \! \prec ) = \sum _{G^{(t)} \subseteq \! \prec } f(G^{(t)}) P(D^{(t)} | G^{(t)}) \left( \sum _{G \subseteq \! \prec } P(G^{(t)}, G | \! \prec ) \prod _{k \ne t} P(D^{(k)} | G) \right) . \end{aligned}$$

Modularity properties are used to substitute \(P(D|G) = \prod _{i=1}^n P(x_i | \pi _i)\), and, \(f(G) = \prod _{i=1}^n f_i(\pi _i)\), and, \(P(G^{(t)}, G | \! \prec ) = \prod _{i=1}^n P(\pi _i^{(t)}, \pi _i | U_i)\), as follows,

$$\begin{aligned} P(f^{(t)}, \mathcal {D} | \! \prec )&= \sum _{\pi _1^{(t)} \subseteq U_1} \cdots \sum _{\pi _n^{(t)} \subseteq U_n} \prod _{i=1}^n \left[ f_i(\pi _i^{(t)}) P(x_i^{(t)} | \pi _i^{(t)}) \right] \\&\times \left( \sum _{\pi _1 \subseteq U_1} \cdots \sum _{\pi _n \subseteq U_n} \prod _{i=1}^n P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i) \right) . \end{aligned}$$

The goal is to factor the terms so that local calculations at each node, \(i\), can be calculated independently. First, we factor terms in the structure bias (the part in round parentheses above) as follows:

$$\begin{aligned}&\sum _{\pi _1 \subseteq U_1} \cdots \sum _{\pi _n \subseteq U_n} \prod _{i=1}^n P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i)\\&\quad = \sum _{\pi _1 \subseteq U_1} \cdots \sum _{\pi _{n-1} \subseteq U_{n-1}} \prod _{i=1}^{n-1} P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i) \\&\quad \quad \times \left( \sum _{\pi _{n} \subseteq U_{n}} P(\pi _n^{(t)}, \pi _n | U_n) \prod _{k \ne t} P(x_n^{(k)} | \pi _n) \right) \\&\quad = \left( \sum _{\pi _1 \subseteq U_1} P(\pi _1^{(t)}, \pi _1 | U_1) \prod _{k \ne t} P(x_1^{(k)} | \pi _1) \right) \\&\quad \quad \times \left( \sum _{\pi _2 \subseteq U_2} P(\pi _2^{(t)}, \pi _2 | U_2) \prod _{k \ne t} P(x_2^{(k)} | \pi _2) \right) \times \cdots \\&\quad \quad \times \left( \sum _{\pi _n \subseteq U_n}P(\pi _n^{(t)}, \pi _n | U_n) \prod _{k \ne t} P(x_n^{(k)} | \pi _n) \right) \\&\quad = \prod _{i=1}^n \sum _{\pi _i \subseteq U_i} P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i).\\ \end{aligned}$$

Plugging the factored structure prior back in to the full equation, we now have:

$$\begin{aligned}&P(f^{(t)}, \mathcal {D} | \! \prec ) \\&\quad = \sum _{\pi _1^{(t)} \subseteq U_1} \!\cdots \! \sum _{\pi _n^{(t)} \subseteq U_n} \prod _{i=1}^n f_i(\pi _i^{(t)}) P(x_i^{(t)} | \pi _i^{(t)}) \left( \sum _{\pi _i \subseteq U_i} P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i) \right) . \end{aligned}$$

If we perform a similar factoring to that done above on the structure prior, we get achieve the desired result:

$$\begin{aligned} P(f^{(t)}, \mathcal {D} | \! \prec ) = \prod _{i=1}^n \sum _{\pi _i^{(t)} \subseteq U_i} f_i(\pi _i^{(t)}) P(x_i^{(t)} | \pi _i^{(t)}) \left( \sum _{\pi _i \subseteq U_i} P(\pi _i^{(t)}, \pi _i | U_i) \prod _{k \ne t} P(x_i^{(k)} | \pi _i) \right) . \end{aligned}$$

1.2 Normalization constant for structure bias

Calculation of the normalization constant requires summing over all possible combinations of parent sets \((\pi _i^{(t)}, \pi _i^{(k)})\), as follows:

$$\begin{aligned} Z = \sum _{\pi _i^{(t)} \subseteq U_i} \sum _{\pi _i^{(k)} \subseteq U_i} (1-\lambda )^{\Delta _{itk}}. \end{aligned}$$

We can simplify this by first fixing parent set \(\pi _i^{(t)}\) while counting how many parent sets \(\pi _i^{(k)}\) will give \(\Delta _{itk} = 0\) (\(\pi _i^{(k)}\) can contain any parents from the set \(\{U_i {\setminus } \pi _i^{(t)}\}\) but none from \(\pi _i^{(t)}\)); then how many \(\pi _i^{(k)}\) will give \(\Delta _{itk} = 1\) (\(\pi _i^{(k)}\) can contain any parents from the set \(\{U_i {\setminus } \pi _i^{(t)}\}\) and exactly one from \(\pi _i^{(t)}\)); etc, up to the maximum of \(\Delta _{itk} = |\pi _i^{(t)}|\). This sum turns out to be a binomial expansion, and so we can write it in closed form. The expansion is demonstrated here:

$$\begin{aligned} Z&= \sum _{\pi _i^{(t)} \subseteq U_i} \left[ \sum _{\{\pi _i^{(k)} \subseteq U_i \big | \Delta _{itk} = 0\}}\!\!\!\!\!\!\! 1 + \sum _{\{\pi _i^{(k)} \subseteq U_i \big | \Delta _{itk} = 1\}}\!\!\!\!\!\!\!\!\!\!\!\! (1-\lambda ) + \cdots + \!\!\!\!\!\!\!\!\!\! \sum _{\{\pi _i^{(k)} \subseteq U_i \big | \Delta _{itk} = |\pi _i^{(t)}|\}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! (1-\lambda )^{| \pi _i^{(t)}|} \right] \\&= \sum _{\pi _i^{(t)} \subseteq U_i} \left[ \left( {\begin{array}{c}| \pi _i^{(t)}|\\ 0\end{array}}\right) 2^{| U_i | - | \pi _i^{(t)}|} + \left( {\begin{array}{c}| \pi _i^{(t)}|\\ 1\end{array}}\right) 2^{| U_i | - | \pi _i^{(t)}|}(1-\lambda ) + \cdots \right. \\&\left. \quad +\, \left( {\begin{array}{c}| \pi _i^{(t)}|\\ | \pi _i^{(t)}|\end{array}}\right) 2^{| U_i | - | \pi _i^{(t)}|}(1-\lambda )^{| \pi _i^{(t)}|} \right] \\&= \sum _{\pi _i^{(t)} \subseteq U_i} \sum _{j=0}^{| \pi _i^{(t)}|} \left( {\begin{array}{c}| \pi _i^{(t)}|\\ j\end{array}}\right) 2^{| U_i | - | \pi _i^{(t)}|} (1-\lambda )^j \\&= \sum _{\pi _i^{(t)} \subseteq U_i} 2^{| U_i | - | \pi _i^{(t)}|} \sum _{j=0}^{| \pi _i^{(t)}|} \left( {\begin{array}{c}| \pi _i^{(t)}|\\ j\end{array}}\right) (1-\lambda )^j \\&= 2^{| U_i |} \sum _{\pi _i^{(t)} \subseteq U_i} 2^{- | \pi _i^{(t)}|} (1 + 1 - \lambda )^{| \pi _i^{(t)}|}\\&= 2^{| U_i |} \sum _{\pi _i^{(t)} \subseteq U_i} \left( \frac{2-\lambda }{2} \right) ^{| \pi _i^{(t)}|} \\&= 2^{| U_i |} \sum _{\pi _i^{(t)} \subseteq U_i} \left( 1-\frac{\lambda }{2}\right) ^{| \pi _i^{(t)}|} . \end{aligned}$$

Next, we perform a similar expansion of the sum over parent sets \(\pi _i^{(t)}\) that have size \(|\pi _i^{(t)}| = 0\), and \(|\pi _i^{(t)}| = 1\), etc up to the maximum \(|\pi _i^{(t)}| = |U_i|\). This sum also turns out to be a binomial expansion and therefore can be simplified into a closed form, as demonstrated:

$$\begin{aligned} Z&= 2^{| U_i |} \sum _{\pi _i^{(t)} \subseteq U_i} \left( 1-\frac{\lambda }{2}\right) ^{| \pi _i^{(t)}|}\\&= 2^{| U_i |} \left[ \sum _{\{ \pi _i^{(t)} \subseteq U_i \big | | \pi _i^{(t)} | = 0\}}\!\!\!\!\!\!\! 1 + \sum _{\{ \pi _i^{(t)} \subseteq U_i \big | | \pi _i^{(t)} | = 1\}}\!\!\!\!\!\!\!\!\!\!\!\! \left( 1 - \frac{\lambda }{2} \right) + \cdots + \!\!\!\!\!\!\!\!\!\!\!\!\!\! \sum _{\{ \pi _i^{(t)} \subseteq U_i \big | | \pi _i^{(t)} | = | U_i |\}}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \left( 1 - \frac{\lambda }{2} \right) ^{| U_i |} \right] \\&= 2^{| U_i |} \left[ \left( {\begin{array}{c}|U_i|\\ 0\end{array}}\right) + \left( {\begin{array}{c}|U_i|\\ 1\end{array}}\right) \left( 1-\frac{\lambda }{2} \right) + \cdots + \left( {\begin{array}{c}|U_i|\\ |U_i|\end{array}}\right) \left( 1 - \frac{\lambda }{2} \right) ^ {|U_i|} \right] \\&= 2^{| U_i |} \sum _{j=0}^{|U_i|} \left( {\begin{array}{c}|U_i|\\ j\end{array}}\right) \left( 1 - \frac{\lambda }{2}\right) ^j\\&= 2^{| U_i |} \left( 2-\frac{\lambda }{2}\right) ^{| U_i |} \\&= (4 - \lambda )^{| U_i |}. \end{aligned}$$

1.3 Integration of Bayesian model average

We integrate over all values of the parameter \(\lambda \) to obtain the Bayesian model average over all possible structure priors, thus eliminating the need to select a point-estimate for \(\lambda \). We use an uninformative, uniform prior for \(\lambda \), i.e., \(p(\lambda |U_i) = 1\) for \(0 \le \lambda \le 1\).

$$\begin{aligned} P(\pi _i^{(t)}, \pi _i^{(k)} | U_i)&= \int _0^1 P(\pi _i^{(t)}, \pi _i^{(k)} | U_i, \lambda ) p(\lambda | U_i) d\lambda \\&= \int _0^1 \frac{(1-\lambda ) ^{\Delta _{itk}}}{(4 - \lambda )^{|U_i|}} p(\lambda | U_i) d\lambda \\&= \int _0^1 \frac{(1-\lambda ) ^ {\Delta _{itk}}}{(4 - \lambda )^{|U_i|}} d\lambda \\ \end{aligned}$$

Next, we use an identity given by Euler in 1748 [1]. If \(\beta \) is the beta function and \(_2\text {F}_{\!1}\) is the ordinary hypergeometric function, then

$$\begin{aligned} \int _0^1 x^{b-1}(1-x)^{c-b-1}(1-zx)^{-a} dx = \beta (b, c-b) _2\text {F}_{\!1} (a, b; c; z) \quad \text {for } \mathfrak {R}(c) > \mathfrak {R}(b) > 0 . \end{aligned}$$

Let \(x = \lambda ,\,a = |U_i|,\,b = 1,\,c = \Delta _{itk} + 2\), and \(z = 1/4\). Then the condition, \(\Delta _{itk} + 2 > 1 > 0\), holds for any \(\Delta _{itk} \ge 0\) which is the valid range for \(\Delta _{itk}\) and:

$$\begin{aligned} \int _0^1\!\! \lambda ^{0}(1-\lambda )^{\Delta _{itk}}(1-\lambda /4)^{-|U_i|} d\lambda&= \beta (1, \Delta _{itk} + 1) _2\text {F}_{\!1} (|U_i|, 1; \Delta _{itk} + 2; 1/4) \\ \int _0^1 \frac{4^{|U_i|} (1-\lambda )^{\Delta _{itk}}}{(4-\lambda )^{|U_i|}} d \lambda&= \left( \frac{0! \Delta _{itk}!}{(\Delta _{itk} + 1)!}\right) {_2}\text {F}_{\!1} (|U_i|, 1; \Delta _{itk} + 2; 1/4) \\ \int _0^1 \frac{(1-\lambda )^{\Delta _{itk}}}{(4-\lambda )^{|U_i|}} d \lambda&= \left( \!\frac{1}{4^{|U_i|} (\Delta _{itk} + 1)} \!\right) {_2}\text {F}_{\!1} (|U_i|, 1; \Delta _{itk} + 2; 1/4) ,\\ \end{aligned}$$

giving the result:

$$\begin{aligned} P(\pi _i^{(t)}, \pi _i^{(k)} | U_i) = \frac{_2\text {F}_{\!1}(|U_i|, 1; \Delta _{itk}+2; 1/4)}{4^{|U_i|} (\Delta _{itk} + 1)} \end{aligned}$$

We only need the solution to this formula for a limited number of combinations of integer values of \(\Delta _{itk}\) and \(|U_i|,\,0 \le \Delta _{itk} \le |U_i| < n\), where \(n\) is the number of variables in the network. Therefore, we pre-compute a lookup table of the necessary values using the GNU Scientific LibraryFootnote 2 hypergeometric function solver.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oyen, D., Lane, T. Transfer learning for Bayesian discovery of multiple Bayesian networks. Knowl Inf Syst 43, 1–28 (2015). https://doi.org/10.1007/s10115-014-0775-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0775-6

Keywords

Navigation