Abstract
Federated Learning is a distributed machine learning approach that enables geographically distributed data silos to collaboratively learn a joint machine learning model without sharing data. Most of the existing work operates on unstructured data, such as images or text, or on structured data assumed to be consistent across the different silos. However, silos often have different schemata, data formats, data values, and access patterns. The field of data integration has developed many methods to address these challenges, including techniques for data exchange and query rewriting using declarative schema map**s, and entity linkage. We propose an architectural vision for an end-to-end Federated Learning and Integration system, incorporating the critical steps of data harmonization and data imputation, to spur further research on the intersection of data management information systems and machine learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
M. Abadi, A. Chu, I. Goodfellow, H.B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016), pp. 308–318
J.L. Ambite, M. Tallis, K.I. Alpert, D.B. Keator, M.D. King, D. Landis, G. Konstantinidis, V.D. Calhoun, S.G. Potkin, J.A. Turner, L. Wang, Schizconnect: virtual data integration in neuroimaging, in Proceedings of the 11th International Conference on Data Integration in the Life Sciences (DILS 2015), Los Angeles, CA (2015), pp. 37–51
O.F. Ayilara, L. Zhang, T.T. Sajobi, R. Sawatzky, E. Bohm, L.M. Lix, Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual. Life Outcomes 17(1), 1–9 (2019)
P. Bellavista, L. Foschini, A. Mora, Decentralised learning in federated deployment environments: a system-level survey. ACM Comput. Surv. (CSUR) 54(1), 1–38 (2021)
D. Bertsimas, C. Pawlowski, Y.D. Zhuo, From predictive methods to missing data imputation: an optimization approach. J. Mach. Learn. Res. 18(1), 7133–7171 (2017)
D.J. Beutel, T. Topal, A. Mathur, X. Qiu, T. Parcollet, P.P. de Gusmão, N.D. Lane, Flower: a friendly federated learning research framework (2020). ar**v:2007.14390
K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H.B. McMahan et al., Towards federated learning at scale: system design (2019). ar**v:1902.01046
K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H.B. McMahan, S. Patel, D. Ramage, A. Segal, K. Seth, Practical secure aggregation for federated learning on user-held data (2016). ar**v:1611.04482
S. Caldas, S.M.K. Duddu, P. Wu, T. Li, J. Konečnỳ, H.B. McMahan, V. Smith, A. Talwalkar, Leaf: a benchmark for federated settings (2018). ar**v:1812.01097
D. Cha, M. Sung, Y.R. Park, Implementing vertical federated learning using autoencoders: practical application, generalizability, and utility study. JMIR Med. Inf. 9(6) (2021). DOI:https://doi.org/10.2196/26598
R.J. Cruz-Correia, P.M. Vieira-Marques, A.M. Ferreira, F.C. Almeida, J.C. Wyatt, A.M. Costa-Pereira, Reviewing the integration of patient data: how systems are evolving in practice to meet patient needs. BMC Med. Inf. Decis. Making 7(1), 1–11 (2007)
A. Doan, A. Halevy, Z. Ives, Principles of Data Integration (Morgan Kauffman, 2012)
X.L. Dong, D. Srivastava, Big Data Integration. Synthesis Lectures on Data Management (Morgan & Claypool Publishers, 2015). https://doi.org/10.2200/S00578ED1V01Y201404DTM040
R. Fagin, P.G. Kolaitis, R.J. Miller, L. Popa, Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005). https://doi.org/10.1016/j.tcs.2004.10.033
I.P. Felligi, A.B. Sunter, A theory for record linkage. J. Amer. Stat. Assoc. 64(328), 1183–1210 (1969)
T. Ghai, Y. Yao, S. Ravi, P. Szekely, Evaluating the feasibility of a provably secure privacy-preserving entity resolution adaptation of ppjoin using homomorphic encryption (2022). https://doi.org/10.48550/ARXIV.2208.07999. https://arxiv.org/abs/2208.07999
G. Gottlob, T. Lukasiewicz, A. Pieris, Datalog+/-: questions and answers, in 14th International Conference on Principles of Knowledge Representation and Reasoning KR (2014)
U. Gupta, D. Stripelis, P.K. Lam, P. Thompson, J.L. Ambite, G. Ver Steeg, Membership inference attacks on deep regression models for neuroimaging, in Medical Imaging with Deep Learning (PMLR, 2021), pp. 228–251
A.Y. Halevy, Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001)
S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, B. Thorne, Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption (2017)
D. Heimbigner, D. McLeod, A federated architecture for information management. ACM Trans. Inf. Syst. (TOIS) 3(3), 253–278 (1985)
R. Jain, Out-of-the-box data engineering events in heterogeneous data environments, in Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405) (IEEE, 2003), pp. 8–21
P. Kairouz, H.B. McMahan, B. Avent, A. Bellet, M. Bennis, A.N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., Advances and open problems in federated learning (2019). ar**v:1912.04977
G.A. Kaissis, M.R. Makowski, D. Rückert, R.F. Braren, Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2(6), 305–311 (2020)
C.A. Knoblock, P. Szekely, J.L. Ambite, A. Goel, S. Gupta, K. Lerman, M. Muslea, M. Taheriyan, P. Mallick, Semi-automatically map** structured sources into the semantic web, in Proceedings of the Extended Semantic Web Conference, Crete, Greece (2012)
T. Köse, S. Özgür, E. Coşgun, A. Keskinoğlu, P. Keskinoğlu, Effect of missing data imputation on deep learning prediction performance for vesicoureteral reflux and recurrent urinary tract infection clinical study. BioMed Res. Int. (2020)
Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, Y. Li, X. Liu, B. He, A survey on federated learning systems: vision, hype and reality for data privacy and protection. IEEE Trans. Knowl. Data Eng. (2021)
T. Li, A.K. Sahu, A. Talwalkar, V. Smith, Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020)
T. Li, A.K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated optimization in heterogeneous networks (2018). ar**v:1812.06127
G. Liang, S.S. Chawathe, Privacy-preserving inter-database operations, in Intelligence and Security Informatics, ed. by H. Chen, R. Moore, D.D. Zeng, J. Leavitt (Springer, Berlin, Heidelberg, 2004), pp.66–82
W.Y.B. Lim, N.C. Luong, D.T. Hoang, Y. Jiao, Y.C. Liang, Q. Yang, D. Niyato, C. Miao, Federated learning in mobile edge networks: a comprehensive survey. IEEE Commun. Surv. & Tutor. 22(3), 2031–2063 (2020)
Y. Liu, A. Huang, Y. Luo, H. Huang, Y. Liu, Y. Chen, L. Feng, T. Chen, H. Yu, Q. Yang, Fedvision: an online visual object detection platform powered by federated learning, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34 (2020), pp. 13172–13179
B. Louie, P. Mork, F. Martin-Sanchez, A. Halevy, P. Tarczy-Hornoch, Data integration and genomic medicine. J. Biomed. Inf. 40(1), 5–16 (2007)
J. Ma, S.A. Naas, S. Sigg, X. Lyu, Privacy-preserving federated learning based on multi-key homomorphic encryption (2021). ar**v:2104.06824
B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in Artificial Intelligence and Statistics (PMLR, 2017), pp. 1273–1282
F. Naumann, M. Herschel, An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. (Morgan & Claypool Publishers, 2010)
D. Ramage, S. Mazzocchi, Federated analytics: collaborative data science without data collection (2020). https://ai.googleblog.com/2020/05/federated-analytics-collaborative-data.html
S.J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, H.B. McMahan, Adaptive federated optimization, in International Conference on Learning Representations (2020)
N. Rieke, J. Hancox, W. Li, F. Milletari, H. Roth, S. Albarqouni, S. Bakas, M.N. Galtier, B. Landman, K. Maier-Hein et al., The future of digital health with federated learning. npj Digital Med. 3(119) (2020)
R.L. Rivest, L. Adleman, M.L. Dertouzos et al., On data banks and privacy homomorphisms. Found. Secure Comput. 4(11), 169–180 (1978)
M. Scannapieco, I. Figotin, E. Bertino, A.K. Elmagarmid, Privacy preserving schema and data matching, in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07 (Association for Computing Machinery, New York, NY, USA, 2007), pp. 653–664. https://doi.org/10.1145/1247480.1247553
O.H.D. Sciences, Informatics: the Book of OHDSI. OHDSI (2019). https://ohdsi.github.io/TheBookOfOhdsi/
A.P. Sheth, J.A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)
D. Stripelis, J.L. Ambite, Accelerating federated learning in heterogeneous data and computational environments (2020). ar**v:2008.11281
D. Stripelis, J.L. Ambite, P. Lam, P. Thompson, Scaling neuroscience research using federated learning, in IEEE International Symposium on Biomedical Imaging, Nice, France (2021)
D. Stripelis, H. Saleem, T. Ghai, N. Dhinagar, U. Gupta, C. Anastasiou, G. Ver Steeg, S. Ravi, M. Naveed, P.M. Thompson et al., Secure neuroimaging analysis using federated learning with homomorphic encryption, in 17th International Symposium on Medical Information Processing and Analysis, vol. 12088 (SPIE, 2021), pp. 351–359
D. Stripelis, P.M. Thompson, J.L. Ambite, Semi-synchronous federated learning for energy-efficient training and accelerated convergence in cross-silo settings. ACM Trans. Intell. Syst. Technol. (TIST) (2022)
S. Van Buuren, K. Groothuis-Oudshoorn, mice: multivariate imputation by chained equations in r. J. Stat. Softw. 45(1), 1–67 (2011)
J. Wang, Z. Charles, Z. Xu, G. Joshi, H.B. McMahan, M. Al-Shedivat, G. Andrew, S. Avestimehr, K. Daly, D. Data et al., A field guide to federated optimization (2021). ar**v:2107.06917
G. Wiederhold, Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992)
Y. Wu, S. Cai, X. **ao, G. Chen, B.C. Ooi, Privacy preserving vertical federated learning for tree-based models (2020). ar**v:2008.06170
G. **ao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, M. Zakharyaschev, Ontology-based data access: a survey, in 27th International Joint Conference on Artificial Intelligence (IJCAI, 2018), pp. 5511–5519
C. **e, S. Koyejo, I. Gupta, Asynchronous federated optimization (2019). ar**v:1903.03934
R. Xu, N. Baracaldo, Y. Zhou, A. Anwar, J. Joshi, H. Ludwig, Fedv: privacy-preserving federated learning over vertically partitioned data (2021)
Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 1–19 (2019)
S. Yang, B. Ren, X. Zhou, L. Liu, Parallel distributed logistic regression for vertical federated learning without third-party coordinator (2019). ar**v:1911.09824
X. Yin, Y. Zhu, J. Hu, A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput. Surv. (CSUR) 54(6), 1–36 (2021)
J. Yoon, J. Jordon, M. Schaar, Gain: missing data imputation using generative adversarial nets, in International Conference on Machine Learning (2018), pp. 5689–5698
M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica et al., Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
C. Zhang, S. Li, J. **a, W. Wang, F. Yan, Y. Liu, Batchcrypt: efficient homomorphic encryption for cross-silo federated learning, in 2020\(\{\)USENIX\(\}\)Annual Technical Conference (\(\{\)USENIX\(\}\)\(\{\)ATC\(\}\) 20) (2020), pp. 493–506
Acknowledgements
This research was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract HR00112090104, and in part by the National Institutes of Health (NIH) under grant R01DA053028.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Stripelis, D., Ambite, J.L. (2023). Federated Learning over Harmonized Data Silos. In: Shaban-Nejad, A., Michalowski, M., Bianco, S. (eds) Artificial Intelligence for Personalized Medicine. W3PHAI 2023. Studies in Computational Intelligence, vol 1106. Springer, Cham. https://doi.org/10.1007/978-3-031-36938-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-36938-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36937-7
Online ISBN: 978-3-031-36938-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)