When Machine Learning Models Leak: An Exploration of Synthetic Training Data

Slokom, Manel; de Wolf, Peter-Paul; Larson, Martha

doi:10.1007/978-3-031-13945-1_20

Manel Slokom^9,10,11,
Peter-Paul de Wolf¹⁰ &
Martha Larson¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13463))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

747 Accesses
2 Citations
3 Altmetric

Abstract

We investigate an attack on a machine learning classifier that predicts the propensity of a person or household to move (i.e., relocate) in the next two years. The attack assumes that the classifier has been made publically available and that the attacker has access to information about a certain number of target individuals. That attacker might also have information about another set of people to train an auxiliary classifier. We show that the attack is possible for target individuals independently of whether they were contained in the original training set of the classifier. However, the attack is somewhat less successful for individuals that were not contained in the original data. Based on this observation, we investigate whether training the classifier on a data set that is synthesized from the original training data, rather than using the original training data directly, would help to mitigate the effectiveness of the attack. Our experimental results show that it does not, leading us to conclude that new approaches to data synthesis must be developed if synthesized data is to resemble “unseen” individuals to an extent great enough to help to block machine learning model attacks.

The views expressed in this paper are those of the authors and do not necessarily reflect the policy of Statistics Netherlands.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploring Privacy-Preserving Techniques on Synthetic Data as a Defense Against Model Inversion Attacks

Towards Improving Privacy of Synthetic DataSets

Genetic Programming with Synthetic Data for Interpretable Regression Modelling and Limited Data

Notes

1.
https://github.com/Trusted-AI/adversarial-robustness-toolbox.
2.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html.
3.
We note that reducing the number of features does not have an impact on the success rate of the attack because there is a redundancy in some variables since they go until 17 years back [3].
4.
Random Classifier using Stratified strategy from https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html.
5.
Random Forest Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

References

Mehnaz, S., Dibbo, S.V., Kabir, E., Li, N., Bertino, E.: Are your sensitive attributes private? Novel model inversion attribute inference attacks on classification models. In: 31st USENIX Security Symposium (USENIX Security), Boston, MA. USENIX Association (2022)
Google Scholar
Andreou, A., Goga, O., Loiseau, P.: Identity vs. attribute disclosure risks for users with multiple social profiles. In: Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 163–170. ASONAM (2017)
Google Scholar
Burger, J., Buelens, B., de Jong, T., Gootzen, Y.: Replacing a survey question by predictive modeling using register data. In: ISI World Statistics Congress, pp. 1–6 (2019)
Google Scholar
Coulter, R., Scott, J.: What motivates residential mobility? Re-examining self-reported reasons for desiring and making residential moves. Popul. Space Place 21(4), 354–371 (2015)
Article Google Scholar
Crull, S.R.: Residential satisfaction, propensity to move, and residential mobility: a causal model. In: Digital Repository at Iowa State University (1979). http://lib.dr.iastate.edu/
De Cristofaro, E.: A critical overview of privacy in machine learning. IEEE Secur. Priv. 19(4), 19–27 (2021)
Article Google Scholar
Domingo-Ferrer, J.: A survey of inference control methods for privacy-preserving data mining. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining, pp. 53–80. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-70992-5_3
Chapter Google Scholar
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Book MATH Google Scholar
Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)
Article MathSciNet Google Scholar
Fackler, D., Rippe, L.: Losing work, moving away? Regional mobility after job loss. Labour 31(4), 457–479 (2017)
Article Google Scholar
Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of the 22nd ACM Conference on Computer and Communications Security (SIGSAC), CCS 2015, pp. 1322–1333 (2015)
Google Scholar
Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., Ristenpart, T.: Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: 23rd USENIX Security Symposium (USENIX Security), San Diego, CA, pp. 17–32. USENIX Association (2014)
Google Scholar
Heyburn, R., et al.: Machine learning using synthetic and real data: similarity of evaluation metrics for different healthcare datasets and for different algorithms. In: Data Science and Knowledge Engineering for Sensing Decision Support: Proceedings of the 13th International FLINS Conference, pp. 1281–1291. World Scientific (2018)
Google Scholar
Hidano, S., Murakami, T., Katsumata, S., Kiyomoto, S., Hanaoka, G.: Model inversion attacks for prediction systems: without knowledge of non-sensitive attributes. In: 15th Annual Conference on Privacy, Security and Trust (PST), pp. 115–11509. IEEE (2017)
Google Scholar
Hittmeir, M., Mayer, R., Ekelhart, A.: A baseline for attribute disclosure risk in synthetic data. In: Proceedings of the 10th ACM Conference on Data and Application Security and Privacy, pp. 133–143 (2020)
Google Scholar
Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)
Book Google Scholar
de Jong, P.A.: Later-life migration in the Netherlands: propensity to move and residential mobility. J. Aging Environ. 36, 1–10 (2020)
Google Scholar
Kleinhans, R.: Does social capital affect residents’ propensity to move from restructured neighbourhoods? Hous. Stud. 24(5), 629–651 (2009)
Article Google Scholar
Liu, B., Ding, M., Shaham, S., Rahayu, W., Farokhi, F., Lin, Z.: When machine learning meets privacy: a survey and outlook. ACM Comput. Surv. 54(2), 1–36 (2021)
Article Google Scholar
Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confid. 1(1) (2009)
Google Scholar
Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks for multiply imputed, synthetic data. J. Priv. Confid. 6(1) (2014)
Google Scholar
Rigaki, M., Garcia, S.: A survey of privacy attacks in machine learning. ar**v preprint ar**v:2007.07646 (2020)
Salter, C., Saydjari, O.S., Schneier, B., Wallner, J.: Toward a secure system engineering methodology. In: Proceedings of the 1998 Workshop on New Security Paradigms, pp. 2–10. NSPW (1998)
Google Scholar
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: Symposium on Security and Privacy (SP), pp. 3–18. IEEE (2017)
Google Scholar
Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic data-anonymisation groundhog day. In: 29th USENIX Security Symposium (USENIX Security). USENIX Association (2020)
Google Scholar
Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9
Chapter Google Scholar
Templ, M.: Statistical Disclosure Control for Microdata: Methods and Applications in R. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-50272-4
Book MATH Google Scholar
Tramèr, F., Zhang, F., Juels, A., Reiter, M.K., Ristenpart, T.: Stealing machine learning models via prediction APIs. In: 25th USENIX Security Symposium (USENIX Security 2016), Austin, TX, pp. 601–618. USENIX Association (2016)
Google Scholar
Yeom, S., Giacomelli, I., Fredrikson, M., Jha, S.: Privacy risk in machine learning: analyzing the connection to overfitting. In: 31st Computer Security Foundations Symposium (CSF), pp. 268–282. IEEE (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Delft University of Technology, Delft, The Netherlands
Manel Slokom
Statistics Netherlands, The Hague, The Netherlands
Manel Slokom & Peter-Paul de Wolf
Radboud University, Nijmegen, The Netherlands
Manel Slokom & Martha Larson

Authors

Manel Slokom
View author publications
You can also search for this author in PubMed Google Scholar
Peter-Paul de Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Martha Larson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manel Slokom .

Editor information

Editors and Affiliations

Universitat Rovira i Virgili, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
Télécom SudParis, Palaiseau, France
Maryline Laurent

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Slokom, M., de Wolf, PP., Larson, M. (2022). When Machine Learning Models Leak: An Exploration of Synthetic Training Data. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-13945-1_20
Published: 14 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

When Machine Learning Models Leak: An Exploration of Synthetic Training Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring Privacy-Preserving Techniques on Synthetic Data as a Defense Against Model Inversion Attacks

Towards Improving Privacy of Synthetic DataSets

Genetic Programming with Synthetic Data for Interpretable Regression Modelling and Limited Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

When Machine Learning Models Leak: An Exploration of Synthetic Training Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring Privacy-Preserving Techniques on Synthetic Data as a Defense Against Model Inversion Attacks

Towards Improving Privacy of Synthetic DataSets

Genetic Programming with Synthetic Data for Interpretable Regression Modelling and Limited Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation