When Machine Learning Models Leak: An Exploration of Synthetic Training Data

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13463))

Included in the following conference series:

Abstract

We investigate an attack on a machine learning classifier that predicts the propensity of a person or household to move (i.e., relocate) in the next two years. The attack assumes that the classifier has been made publically available and that the attacker has access to information about a certain number of target individuals. That attacker might also have information about another set of people to train an auxiliary classifier. We show that the attack is possible for target individuals independently of whether they were contained in the original training set of the classifier. However, the attack is somewhat less successful for individuals that were not contained in the original data. Based on this observation, we investigate whether training the classifier on a data set that is synthesized from the original training data, rather than using the original training data directly, would help to mitigate the effectiveness of the attack. Our experimental results show that it does not, leading us to conclude that new approaches to data synthesis must be developed if synthesized data is to resemble “unseen” individuals to an extent great enough to help to block machine learning model attacks.

The views expressed in this paper are those of the authors and do not necessarily reflect the policy of Statistics Netherlands.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/Trusted-AI/adversarial-robustness-toolbox.

  2. 2.

    https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html.

  3. 3.

    We note that reducing the number of features does not have an impact on the success rate of the attack because there is a redundancy in some variables since they go until 17 years back [3].

  4. 4.

    Random Classifier using Stratified strategy from https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html.

  5. 5.

    Random Forest Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

References

  1. Mehnaz, S., Dibbo, S.V., Kabir, E., Li, N., Bertino, E.: Are your sensitive attributes private? Novel model inversion attribute inference attacks on classification models. In: 31st USENIX Security Symposium (USENIX Security), Boston, MA. USENIX Association (2022)

    Google Scholar 

  2. Andreou, A., Goga, O., Loiseau, P.: Identity vs. attribute disclosure risks for users with multiple social profiles. In: Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 163–170. ASONAM (2017)

    Google Scholar 

  3. Burger, J., Buelens, B., de Jong, T., Gootzen, Y.: Replacing a survey question by predictive modeling using register data. In: ISI World Statistics Congress, pp. 1–6 (2019)

    Google Scholar 

  4. Coulter, R., Scott, J.: What motivates residential mobility? Re-examining self-reported reasons for desiring and making residential moves. Popul. Space Place 21(4), 354–371 (2015)

    Article  Google Scholar 

  5. Crull, S.R.: Residential satisfaction, propensity to move, and residential mobility: a causal model. In: Digital Repository at Iowa State University (1979). http://lib.dr.iastate.edu/

  6. De Cristofaro, E.: A critical overview of privacy in machine learning. IEEE Secur. Priv. 19(4), 19–27 (2021)

    Article  Google Scholar 

  7. Domingo-Ferrer, J.: A survey of inference control methods for privacy-preserving data mining. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining, pp. 53–80. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-70992-5_3

    Chapter  Google Scholar 

  8. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  9. Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)

    Article  MathSciNet  Google Scholar 

  10. Fackler, D., Rippe, L.: Losing work, moving away? Regional mobility after job loss. Labour 31(4), 457–479 (2017)

    Article  Google Scholar 

  11. Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of the 22nd ACM Conference on Computer and Communications Security (SIGSAC), CCS 2015, pp. 1322–1333 (2015)

    Google Scholar 

  12. Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., Ristenpart, T.: Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: 23rd USENIX Security Symposium (USENIX Security), San Diego, CA, pp. 17–32. USENIX Association (2014)

    Google Scholar 

  13. Heyburn, R., et al.: Machine learning using synthetic and real data: similarity of evaluation metrics for different healthcare datasets and for different algorithms. In: Data Science and Knowledge Engineering for Sensing Decision Support: Proceedings of the 13th International FLINS Conference, pp. 1281–1291. World Scientific (2018)

    Google Scholar 

  14. Hidano, S., Murakami, T., Katsumata, S., Kiyomoto, S., Hanaoka, G.: Model inversion attacks for prediction systems: without knowledge of non-sensitive attributes. In: 15th Annual Conference on Privacy, Security and Trust (PST), pp. 115–11509. IEEE (2017)

    Google Scholar 

  15. Hittmeir, M., Mayer, R., Ekelhart, A.: A baseline for attribute disclosure risk in synthetic data. In: Proceedings of the 10th ACM Conference on Data and Application Security and Privacy, pp. 133–143 (2020)

    Google Scholar 

  16. Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)

    Book  Google Scholar 

  17. de Jong, P.A.: Later-life migration in the Netherlands: propensity to move and residential mobility. J. Aging Environ. 36, 1–10 (2020)

    Google Scholar 

  18. Kleinhans, R.: Does social capital affect residents’ propensity to move from restructured neighbourhoods? Hous. Stud. 24(5), 629–651 (2009)

    Article  Google Scholar 

  19. Liu, B., Ding, M., Shaham, S., Rahayu, W., Farokhi, F., Lin, Z.: When machine learning meets privacy: a survey and outlook. ACM Comput. Surv. 54(2), 1–36 (2021)

    Article  Google Scholar 

  20. Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confid. 1(1) (2009)

    Google Scholar 

  21. Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks for multiply imputed, synthetic data. J. Priv. Confid. 6(1) (2014)

    Google Scholar 

  22. Rigaki, M., Garcia, S.: A survey of privacy attacks in machine learning. ar**v preprint ar**v:2007.07646 (2020)

  23. Salter, C., Saydjari, O.S., Schneier, B., Wallner, J.: Toward a secure system engineering methodology. In: Proceedings of the 1998 Workshop on New Security Paradigms, pp. 2–10. NSPW (1998)

    Google Scholar 

  24. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: Symposium on Security and Privacy (SP), pp. 3–18. IEEE (2017)

    Google Scholar 

  25. Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic data-anonymisation groundhog day. In: 29th USENIX Security Symposium (USENIX Security). USENIX Association (2020)

    Google Scholar 

  26. Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9

    Chapter  Google Scholar 

  27. Templ, M.: Statistical Disclosure Control for Microdata: Methods and Applications in R. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-50272-4

    Book  MATH  Google Scholar 

  28. Tramèr, F., Zhang, F., Juels, A., Reiter, M.K., Ristenpart, T.: Stealing machine learning models via prediction APIs. In: 25th USENIX Security Symposium (USENIX Security 2016), Austin, TX, pp. 601–618. USENIX Association (2016)

    Google Scholar 

  29. Yeom, S., Giacomelli, I., Fredrikson, M., Jha, S.: Privacy risk in machine learning: analyzing the connection to overfitting. In: 31st Computer Security Foundations Symposium (CSF), pp. 268–282. IEEE (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manel Slokom .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Slokom, M., de Wolf, PP., Larson, M. (2022). When Machine Learning Models Leak: An Exploration of Synthetic Training Data. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13945-1_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13944-4

  • Online ISBN: 978-3-031-13945-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation