A Survey on Privacy Preserving Synthetic Data Generation and a Discussion on a Privacy-Utility Trade-off Problem

  • Conference paper
  • First Online:
Science of Cyber Security - SciSec 2022 Workshops (SciSec 2022)

Abstract

Sharing microdata is a very important part of the present day world, but when they contain sensitive information, privacy to individuals needs to be guaranteed before release of data. One idea is to study the distributional properties of a data-set and generate synthetic data which has similar properties but unlike the original data comes with a privacy guarantee. In this review paper, we describe in detail, some advanced privacy guarantees that needs to be checked before release of such information. Also, we discuss some utility metrics to measure the remaining utility of released data. Very few mechanisms have been developed to ensure utility to synthetic data, provided a very strong privacy guarantee is maintained. We discuss some existing methodologies on privacy preserving synthetic data generation and discuss a privacy utility tradeoff problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bachi, R., Baron, R.: Confidentiality problems related to data banks. Bull. Int. Stat. Inst. 43, 225–241 (1969)

    Google Scholar 

  2. Bagrow, J.P., Liu, X., Mitchell, L.: Information flow reveals prediction limits in online social activity. Nat. Hum. Behav. (2019). https://doi.org/10.1038/s41562-018-0510-5

    Article  Google Scholar 

  3. Bakshy, E., Rosenn, I., Marlow, C., Adamic, L.: The role of social networks in information diffusion. In: Proceedings of the 21st Annual Conference on World Wide Web (2012). https://doi.org/10.1145/2187836.2187907

  4. Cassel, C.: Probability based disclosures in personal integrity and the need for data in the social sciences, pp. 189–193. Stockholm Swedish council for the social sciences (1976)

    Google Scholar 

  5. Cox, L.H., et al.: Risk-utility paradigms for statistical disclosure limitation: how to think, but not how to act [with discussions]. International Statistical Review/Revue Internationale de Statistique 79(2), 160–199 (2011). https://www.jstor.org/stable/41305021

  6. Dalenius, T.: The invasion of privacy problem and statistics production-an overview. Statistisk Tidskrzft 12, 213–225 (1974)

    Google Scholar 

  7. Dalenius, T., Reiss, S.P.: Data-swap**: a technique for disclosure control. J. Stat. Plann. Infer. 6, 73–85 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  8. DeGroot, M.H.: Optimal Statistical Decisions. Mc-Graw-Hill, New York (1970)

    MATH  Google Scholar 

  9. Dong, J., Roth, A., Su, W.J.: Gaussian differential privacy (2019)

    Google Scholar 

  10. Duncan, G., Lambert, D.: Disclosure-limited data dissemination. J. Am. Stat. Assoc. 81, 10–28 (1986)

    Article  Google Scholar 

  11. Duncan, G., Lambert, D.: The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989)

    Google Scholar 

  12. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  13. Fienberg, S.E., Rinaldo, A., Yang, X.: Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 187–199. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15838-4_17

    Chapter  Google Scholar 

  14. Frank, O.: An application of information theory to the problem of statistical disclosure. J. Stat. Plann. Infer. 2, 143–152 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  15. Frank, O.: Inferring individual information from released statistics. paper presented at the 42nd Session of the Intemational Statistical Institute Subcommittee on Disclosure Avoidance Techniques Federal Committee on Manila Philippines (1979)

    Google Scholar 

  16. Frank, O.: Statistical disclosure control. Technical report 108, University of California, Riverside (1982)

    Google Scholar 

  17. Fuller, W.A.: Masking procedures for microdata disclosure limitation. J. Official Stat. 9, 383–406 (1993)

    Google Scholar 

  18. Ghatak, D., Roy, B.: Estimation of true quantiles from quantitative data obfuscated with additive noise. J. Official Stat. 34, 671–694 (2018)

    Article  Google Scholar 

  19. Ghatak, D.: Data obfuscation. Thesis submitted to ISI Kolkata (2019)

    Google Scholar 

  20. Gouweleeuw, J., Kooimann, P., L.Willenberg, Dewolf, P.: Post randomization for statistical disclosure control; theory and implementation. J. Official Stat. 14(4), 463–478 (1998)

    Google Scholar 

  21. Hall, R., Rinaldo, A., Wasserman, L.: Random differential privacy. J. Priv. Confidentiality, 4–2, 43–59 (2012)

    Google Scholar 

  22. Hardt, M., Talwar, K.: On the geometry of differential privacy. vol. 705:714. STOC 10: In: Proceedings of the forty-second ACM symposium on Theory of computing (2010). https://doi.org/10.1145/1806689.1806786

  23. Li, H., **ong, L., Jiang, X.: Differentially private synthesization of multi-dimensionaldata using copula functions. In: 17th International Conference on Extending Database Technology (2014). https://doi.org/10.5441/002/edbt.2014.43

  24. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, Istanbul, pp. 106–115 (2007). https://doi.org/10.1109/ICDE.2007.367856

  25. Li, Z., Dang, T., Wang, T., Li, N.: MGD: a utility metric for private data publication, pp. 106–119 (2021). https://doi.org/10.1145/3491371.3491385

  26. Lopuhaä-Zwakenberg, M., Tong, H., Škorić, B.: Data sanitisation protocols for the privacy funnel with differential privacy guarantees. Int. J. Adv. Secur. 13(3–4), 162–174 (2021). https://arxiv.org/abs/2008.13151

  27. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. In: 22nd International Conference on Data Engineering (ICDE 2006), Atlanta, GA, USA (2006). https://doi.org/10.1109/ICDE.2006.1

  28. Mahawaga Arachchige, P.C., Bertok, P., Khalil, I., Liu, D., Camtepe, S., Atiquzzaman, M.: Local differential privacy for deep learning. IEEE Internet Things J. 7, 5827–5842 (2020)

    Google Scholar 

  29. Matthews, G., Harel, O.: Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Stat. Surv. 5, 1–29 (2011). https://doi.org/10.1214/11-SS074

    Article  MathSciNet  MATH  Google Scholar 

  30. McKenna, R., Miklau, G., Hay, M., Machanavajjhala, A.: Optimizing error of high-dimensional statistical queries under differential privacy. In: Proceedings of the VLDB Endowment, vol. 11(10) (2018). https://doi.org/10.14778/3231751.3231769

  31. McKenna, R., Sheldon, D., Miklau, G.: Graphical-model based estimation and inference for differential privacy abs/1901.09136 (2019). https://proceedings.mlr.press/v97/mckenna19a.html

  32. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, pp. 94–103 (2007). https://doi.org/10.1109/FOCS.2007.66

  33. Mironov, I.: Rényi differential privacy. In: IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275 (2017). https://doi.org/10.1109/CSF.2017.11

  34. Moore, R.A.: Controlled data swap** techniques for masking use microdata sets. US Bureau of the Census, Statistical Research Division (1996). https://www.census.gov/srd/www/byyear.html2

  35. Mugge, R.: Issues in protecting confidentiality in national health statistics. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 592–594 (1983)

    Google Scholar 

  36. Muralidhar, K., Parsa, R., Sarathy, R.: A general additive data perturbation method for database security. Manage. Sci. 45, 1399–1415 (1999)

    Article  Google Scholar 

  37. Muralidhar, K., Domingo-Ferrer, J., Martínez, S.: epsilon-differential privacy for microdata releases does not guarantee confidentiality (let alone utility). In: Book: Privacy in Statistical Databases, UNESCO Chair in Data Privacy, International Conference, PSD 2020, Tarragona, Spain, 23–25 September 2020, Proceedings (2020). https://doi.org/10.1007/978-3-030-57521-2_2

  38. Poole, W.K.: Estimation of the distribution function of a continuous type random variable through randomized response. J. Am. Stat. Assoc. 69(348), 1002–1005. Taylor and Francis (1974)

    Google Scholar 

  39. Rubin, D.B.: Discussion statistical disclosure limitation. J. Official Stat. 461–468, 461–468 (1993)

    Google Scholar 

  40. Salamatian, S., Calmon, F., Fawaz, N., Makhdoumi, A., Médard, M.: Privacy-utility tradeoff and privacy funnel (2020)

    Google Scholar 

  41. Sankar, L., Rajagopalan, S.R., Poor, H.V.: Utility-privacy tradeoffs in databases: an information-theoretic approach. IEEE Trans. Inform. Forensics Secur. 8(6), 838–852 (2013). https://doi.org/10.1109/TIFS.2013.2253320

    Article  Google Scholar 

  42. Steinberg, J., Pritzker, L.: Some experiences with and reflections on data linkage in the united states. Bull. Int. Stat. Inst. 786–808 (1967)

    Google Scholar 

  43. Sweeney, L.: k-anonymity: A model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)

    Google Scholar 

  44. Torkzadehmahani, R., Kairouz, P., Paten, B.: DP-CGAN: differentially private synthetic data and label generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2019)

    Google Scholar 

  45. Wang, W., Ying, L., Zhang, J.: On the relation between identifiability, differential privacy, and mutual-information privacy. IEEE Trans. Inf. Theory 62(9), 5018–5029 (2016). https://doi.org/10.1109/TIT.2016.2584610

  46. Warner, S.L.: Randomized response: a survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 60, 63–69 (1965)

    Article  MATH  Google Scholar 

  47. Wasserman, L., Zhou, S.: A statistical framework for differential privacy. J. Am. Stat. Assoc. 105(489), 375–389 (2010). https://doi.org/10.1198/jasa.2009.tm08651

  48. Winograd-Cort, D., Haeberlen, A., Roth, A., Pierce, B.C.: A framework for adaptive differential privacy. Proc. ACM Program. Lang. (2017). https://doi.org/10.1145/3110254

    Article  Google Scholar 

  49. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., **ao, X.: PrivBayes: Private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017). https://doi.org/10.1145/3134428

    Article  MathSciNet  MATH  Google Scholar 

  50. Zhang, T.K.N.C., You, J.: Measuring identification risk in microdata release and its control by post-randomization. Center for Disclosure Avoidance Research U.S. Census Bureau Washington DC 20233 (2016)

    Google Scholar 

  51. Zhang, Z., et al.: PrivSyn: differentially private data synthesis. ar**v:2012.15128 (2021)

Download references

Acknowledgments

We are grateful to the reviewers of SciSec 2022 for their thorough proof-reading and valuable comments on our paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debolina Ghatak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ghatak, D., Sakurai, K. (2022). A Survey on Privacy Preserving Synthetic Data Generation and a Discussion on a Privacy-Utility Trade-off Problem. In: Su, C., Sakurai, K. (eds) Science of Cyber Security - SciSec 2022 Workshops. SciSec 2022. Communications in Computer and Information Science, vol 1680. Springer, Singapore. https://doi.org/10.1007/978-981-19-7769-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-7769-5_13

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-7768-8

  • Online ISBN: 978-981-19-7769-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation