Federated Learning over Harmonized Data Silos

  • Chapter
  • First Online:
Artificial Intelligence for Personalized Medicine (W3PHAI 2023)

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1106))

Included in the following conference series:

Abstract

Federated Learning is a distributed machine learning approach that enables geographically distributed data silos to collaboratively learn a joint machine learning model without sharing data. Most of the existing work operates on unstructured data, such as images or text, or on structured data assumed to be consistent across the different silos. However, silos often have different schemata, data formats, data values, and access patterns. The field of data integration has developed many methods to address these challenges, including techniques for data exchange and query rewriting using declarative schema map**s, and entity linkage. We propose an architectural vision for an end-to-end Federated Learning and Integration system, incorporating the critical steps of data harmonization and data imputation, to spur further research on the intersection of data management information systems and machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. M. Abadi, A. Chu, I. Goodfellow, H.B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016), pp. 308–318

    Google Scholar 

  2. J.L. Ambite, M. Tallis, K.I. Alpert, D.B. Keator, M.D. King, D. Landis, G. Konstantinidis, V.D. Calhoun, S.G. Potkin, J.A. Turner, L. Wang, Schizconnect: virtual data integration in neuroimaging, in Proceedings of the 11th International Conference on Data Integration in the Life Sciences (DILS 2015), Los Angeles, CA (2015), pp. 37–51

    Google Scholar 

  3. O.F. Ayilara, L. Zhang, T.T. Sajobi, R. Sawatzky, E. Bohm, L.M. Lix, Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual. Life Outcomes 17(1), 1–9 (2019)

    Article  Google Scholar 

  4. P. Bellavista, L. Foschini, A. Mora, Decentralised learning in federated deployment environments: a system-level survey. ACM Comput. Surv. (CSUR) 54(1), 1–38 (2021)

    Article  Google Scholar 

  5. D. Bertsimas, C. Pawlowski, Y.D. Zhuo, From predictive methods to missing data imputation: an optimization approach. J. Mach. Learn. Res. 18(1), 7133–7171 (2017)

    MathSciNet  MATH  Google Scholar 

  6. D.J. Beutel, T. Topal, A. Mathur, X. Qiu, T. Parcollet, P.P. de Gusmão, N.D. Lane, Flower: a friendly federated learning research framework (2020). ar**v:2007.14390

  7. K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H.B. McMahan et al., Towards federated learning at scale: system design (2019). ar**v:1902.01046

  8. K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H.B. McMahan, S. Patel, D. Ramage, A. Segal, K. Seth, Practical secure aggregation for federated learning on user-held data (2016). ar**v:1611.04482

  9. S. Caldas, S.M.K. Duddu, P. Wu, T. Li, J. Konečnỳ, H.B. McMahan, V. Smith, A. Talwalkar, Leaf: a benchmark for federated settings (2018). ar**v:1812.01097

  10. D. Cha, M. Sung, Y.R. Park, Implementing vertical federated learning using autoencoders: practical application, generalizability, and utility study. JMIR Med. Inf. 9(6) (2021). DOI:https://doi.org/10.2196/26598

  11. R.J. Cruz-Correia, P.M. Vieira-Marques, A.M. Ferreira, F.C. Almeida, J.C. Wyatt, A.M. Costa-Pereira, Reviewing the integration of patient data: how systems are evolving in practice to meet patient needs. BMC Med. Inf. Decis. Making 7(1), 1–11 (2007)

    Google Scholar 

  12. A. Doan, A. Halevy, Z. Ives, Principles of Data Integration (Morgan Kauffman, 2012)

    Google Scholar 

  13. X.L. Dong, D. Srivastava, Big Data Integration. Synthesis Lectures on Data Management (Morgan & Claypool Publishers, 2015). https://doi.org/10.2200/S00578ED1V01Y201404DTM040

  14. R. Fagin, P.G. Kolaitis, R.J. Miller, L. Popa, Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005). https://doi.org/10.1016/j.tcs.2004.10.033

    Article  MathSciNet  MATH  Google Scholar 

  15. I.P. Felligi, A.B. Sunter, A theory for record linkage. J. Amer. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  16. T. Ghai, Y. Yao, S. Ravi, P. Szekely, Evaluating the feasibility of a provably secure privacy-preserving entity resolution adaptation of ppjoin using homomorphic encryption (2022). https://doi.org/10.48550/ARXIV.2208.07999. https://arxiv.org/abs/2208.07999

  17. G. Gottlob, T. Lukasiewicz, A. Pieris, Datalog+/-: questions and answers, in 14th International Conference on Principles of Knowledge Representation and Reasoning KR (2014)

    Google Scholar 

  18. U. Gupta, D. Stripelis, P.K. Lam, P. Thompson, J.L. Ambite, G. Ver Steeg, Membership inference attacks on deep regression models for neuroimaging, in Medical Imaging with Deep Learning (PMLR, 2021), pp. 228–251

    Google Scholar 

  19. A.Y. Halevy, Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001)

    Article  MATH  Google Scholar 

  20. S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, B. Thorne, Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption (2017)

    Google Scholar 

  21. D. Heimbigner, D. McLeod, A federated architecture for information management. ACM Trans. Inf. Syst. (TOIS) 3(3), 253–278 (1985)

    Article  Google Scholar 

  22. R. Jain, Out-of-the-box data engineering events in heterogeneous data environments, in Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405) (IEEE, 2003), pp. 8–21

    Google Scholar 

  23. P. Kairouz, H.B. McMahan, B. Avent, A. Bellet, M. Bennis, A.N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., Advances and open problems in federated learning (2019). ar**v:1912.04977

  24. G.A. Kaissis, M.R. Makowski, D. Rückert, R.F. Braren, Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2(6), 305–311 (2020)

    Article  Google Scholar 

  25. C.A. Knoblock, P. Szekely, J.L. Ambite, A. Goel, S. Gupta, K. Lerman, M. Muslea, M. Taheriyan, P. Mallick, Semi-automatically map** structured sources into the semantic web, in Proceedings of the Extended Semantic Web Conference, Crete, Greece (2012)

    Google Scholar 

  26. T. Köse, S. Özgür, E. Coşgun, A. Keskinoğlu, P. Keskinoğlu, Effect of missing data imputation on deep learning prediction performance for vesicoureteral reflux and recurrent urinary tract infection clinical study. BioMed Res. Int. (2020)

    Google Scholar 

  27. Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, Y. Li, X. Liu, B. He, A survey on federated learning systems: vision, hype and reality for data privacy and protection. IEEE Trans. Knowl. Data Eng. (2021)

    Google Scholar 

  28. T. Li, A.K. Sahu, A. Talwalkar, V. Smith, Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020)

    Article  Google Scholar 

  29. T. Li, A.K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated optimization in heterogeneous networks (2018). ar**v:1812.06127

  30. G. Liang, S.S. Chawathe, Privacy-preserving inter-database operations, in Intelligence and Security Informatics, ed. by H. Chen, R. Moore, D.D. Zeng, J. Leavitt (Springer, Berlin, Heidelberg, 2004), pp.66–82

    Google Scholar 

  31. W.Y.B. Lim, N.C. Luong, D.T. Hoang, Y. Jiao, Y.C. Liang, Q. Yang, D. Niyato, C. Miao, Federated learning in mobile edge networks: a comprehensive survey. IEEE Commun. Surv. & Tutor. 22(3), 2031–2063 (2020)

    Article  Google Scholar 

  32. Y. Liu, A. Huang, Y. Luo, H. Huang, Y. Liu, Y. Chen, L. Feng, T. Chen, H. Yu, Q. Yang, Fedvision: an online visual object detection platform powered by federated learning, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34 (2020), pp. 13172–13179

    Google Scholar 

  33. B. Louie, P. Mork, F. Martin-Sanchez, A. Halevy, P. Tarczy-Hornoch, Data integration and genomic medicine. J. Biomed. Inf. 40(1), 5–16 (2007)

    Article  Google Scholar 

  34. J. Ma, S.A. Naas, S. Sigg, X. Lyu, Privacy-preserving federated learning based on multi-key homomorphic encryption (2021). ar**v:2104.06824

  35. B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in Artificial Intelligence and Statistics (PMLR, 2017), pp. 1273–1282

    Google Scholar 

  36. F. Naumann, M. Herschel, An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. (Morgan & Claypool Publishers, 2010)

    Google Scholar 

  37. D. Ramage, S. Mazzocchi, Federated analytics: collaborative data science without data collection (2020). https://ai.googleblog.com/2020/05/federated-analytics-collaborative-data.html

  38. S.J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, H.B. McMahan, Adaptive federated optimization, in International Conference on Learning Representations (2020)

    Google Scholar 

  39. N. Rieke, J. Hancox, W. Li, F. Milletari, H. Roth, S. Albarqouni, S. Bakas, M.N. Galtier, B. Landman, K. Maier-Hein et al., The future of digital health with federated learning. npj Digital Med. 3(119) (2020)

    Google Scholar 

  40. R.L. Rivest, L. Adleman, M.L. Dertouzos et al., On data banks and privacy homomorphisms. Found. Secure Comput. 4(11), 169–180 (1978)

    MathSciNet  Google Scholar 

  41. M. Scannapieco, I. Figotin, E. Bertino, A.K. Elmagarmid, Privacy preserving schema and data matching, in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07 (Association for Computing Machinery, New York, NY, USA, 2007), pp. 653–664. https://doi.org/10.1145/1247480.1247553

  42. O.H.D. Sciences, Informatics: the Book of OHDSI. OHDSI (2019). https://ohdsi.github.io/TheBookOfOhdsi/

  43. A.P. Sheth, J.A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)

    Article  Google Scholar 

  44. D. Stripelis, J.L. Ambite, Accelerating federated learning in heterogeneous data and computational environments (2020). ar**v:2008.11281

  45. D. Stripelis, J.L. Ambite, P. Lam, P. Thompson, Scaling neuroscience research using federated learning, in IEEE International Symposium on Biomedical Imaging, Nice, France (2021)

    Google Scholar 

  46. D. Stripelis, H. Saleem, T. Ghai, N. Dhinagar, U. Gupta, C. Anastasiou, G. Ver Steeg, S. Ravi, M. Naveed, P.M. Thompson et al., Secure neuroimaging analysis using federated learning with homomorphic encryption, in 17th International Symposium on Medical Information Processing and Analysis, vol. 12088 (SPIE, 2021), pp. 351–359

    Google Scholar 

  47. D. Stripelis, P.M. Thompson, J.L. Ambite, Semi-synchronous federated learning for energy-efficient training and accelerated convergence in cross-silo settings. ACM Trans. Intell. Syst. Technol. (TIST) (2022)

    Google Scholar 

  48. S. Van Buuren, K. Groothuis-Oudshoorn, mice: multivariate imputation by chained equations in r. J. Stat. Softw. 45(1), 1–67 (2011)

    Google Scholar 

  49. J. Wang, Z. Charles, Z. Xu, G. Joshi, H.B. McMahan, M. Al-Shedivat, G. Andrew, S. Avestimehr, K. Daly, D. Data et al., A field guide to federated optimization (2021). ar**v:2107.06917

  50. G. Wiederhold, Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992)

    Article  Google Scholar 

  51. Y. Wu, S. Cai, X. **ao, G. Chen, B.C. Ooi, Privacy preserving vertical federated learning for tree-based models (2020). ar**v:2008.06170

  52. G. **ao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, M. Zakharyaschev, Ontology-based data access: a survey, in 27th International Joint Conference on Artificial Intelligence (IJCAI, 2018), pp. 5511–5519

    Google Scholar 

  53. C. **e, S. Koyejo, I. Gupta, Asynchronous federated optimization (2019). ar**v:1903.03934

  54. R. Xu, N. Baracaldo, Y. Zhou, A. Anwar, J. Joshi, H. Ludwig, Fedv: privacy-preserving federated learning over vertically partitioned data (2021)

    Google Scholar 

  55. Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 1–19 (2019)

    Article  Google Scholar 

  56. S. Yang, B. Ren, X. Zhou, L. Liu, Parallel distributed logistic regression for vertical federated learning without third-party coordinator (2019). ar**v:1911.09824

  57. X. Yin, Y. Zhu, J. Hu, A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput. Surv. (CSUR) 54(6), 1–36 (2021)

    Article  Google Scholar 

  58. J. Yoon, J. Jordon, M. Schaar, Gain: missing data imputation using generative adversarial nets, in International Conference on Machine Learning (2018), pp. 5689–5698

    Google Scholar 

  59. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica et al., Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)

    Google Scholar 

  60. C. Zhang, S. Li, J. **a, W. Wang, F. Yan, Y. Liu, Batchcrypt: efficient homomorphic encryption for cross-silo federated learning, in 2020\(\{\)USENIX\(\}\)Annual Technical Conference (\(\{\)USENIX\(\}\)\(\{\)ATC\(\}\) 20) (2020), pp. 493–506

    Google Scholar 

Download references

Acknowledgements

This research was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract HR00112090104, and in part by the National Institutes of Health (NIH) under grant R01DA053028.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitris Stripelis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Stripelis, D., Ambite, J.L. (2023). Federated Learning over Harmonized Data Silos. In: Shaban-Nejad, A., Michalowski, M., Bianco, S. (eds) Artificial Intelligence for Personalized Medicine. W3PHAI 2023. Studies in Computational Intelligence, vol 1106. Springer, Cham. https://doi.org/10.1007/978-3-031-36938-4_3

Download citation

Publish with us

Policies and ethics

Navigation