Abstract
Sharing microdata is a very important part of the present day world, but when they contain sensitive information, privacy to individuals needs to be guaranteed before release of data. One idea is to study the distributional properties of a data-set and generate synthetic data which has similar properties but unlike the original data comes with a privacy guarantee. In this review paper, we describe in detail, some advanced privacy guarantees that needs to be checked before release of such information. Also, we discuss some utility metrics to measure the remaining utility of released data. Very few mechanisms have been developed to ensure utility to synthetic data, provided a very strong privacy guarantee is maintained. We discuss some existing methodologies on privacy preserving synthetic data generation and discuss a privacy utility tradeoff problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bachi, R., Baron, R.: Confidentiality problems related to data banks. Bull. Int. Stat. Inst. 43, 225–241 (1969)
Bagrow, J.P., Liu, X., Mitchell, L.: Information flow reveals prediction limits in online social activity. Nat. Hum. Behav. (2019). https://doi.org/10.1038/s41562-018-0510-5
Bakshy, E., Rosenn, I., Marlow, C., Adamic, L.: The role of social networks in information diffusion. In: Proceedings of the 21st Annual Conference on World Wide Web (2012). https://doi.org/10.1145/2187836.2187907
Cassel, C.: Probability based disclosures in personal integrity and the need for data in the social sciences, pp. 189–193. Stockholm Swedish council for the social sciences (1976)
Cox, L.H., et al.: Risk-utility paradigms for statistical disclosure limitation: how to think, but not how to act [with discussions]. International Statistical Review/Revue Internationale de Statistique 79(2), 160–199 (2011). https://www.jstor.org/stable/41305021
Dalenius, T.: The invasion of privacy problem and statistics production-an overview. Statistisk Tidskrzft 12, 213–225 (1974)
Dalenius, T., Reiss, S.P.: Data-swap**: a technique for disclosure control. J. Stat. Plann. Infer. 6, 73–85 (1982)
DeGroot, M.H.: Optimal Statistical Decisions. Mc-Graw-Hill, New York (1970)
Dong, J., Roth, A., Su, W.J.: Gaussian differential privacy (2019)
Duncan, G., Lambert, D.: Disclosure-limited data dissemination. J. Am. Stat. Assoc. 81, 10–28 (1986)
Duncan, G., Lambert, D.: The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Fienberg, S.E., Rinaldo, A., Yang, X.: Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 187–199. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15838-4_17
Frank, O.: An application of information theory to the problem of statistical disclosure. J. Stat. Plann. Infer. 2, 143–152 (1978)
Frank, O.: Inferring individual information from released statistics. paper presented at the 42nd Session of the Intemational Statistical Institute Subcommittee on Disclosure Avoidance Techniques Federal Committee on Manila Philippines (1979)
Frank, O.: Statistical disclosure control. Technical report 108, University of California, Riverside (1982)
Fuller, W.A.: Masking procedures for microdata disclosure limitation. J. Official Stat. 9, 383–406 (1993)
Ghatak, D., Roy, B.: Estimation of true quantiles from quantitative data obfuscated with additive noise. J. Official Stat. 34, 671–694 (2018)
Ghatak, D.: Data obfuscation. Thesis submitted to ISI Kolkata (2019)
Gouweleeuw, J., Kooimann, P., L.Willenberg, Dewolf, P.: Post randomization for statistical disclosure control; theory and implementation. J. Official Stat. 14(4), 463–478 (1998)
Hall, R., Rinaldo, A., Wasserman, L.: Random differential privacy. J. Priv. Confidentiality, 4–2, 43–59 (2012)
Hardt, M., Talwar, K.: On the geometry of differential privacy. vol. 705:714. STOC 10: In: Proceedings of the forty-second ACM symposium on Theory of computing (2010). https://doi.org/10.1145/1806689.1806786
Li, H., **ong, L., Jiang, X.: Differentially private synthesization of multi-dimensionaldata using copula functions. In: 17th International Conference on Extending Database Technology (2014). https://doi.org/10.5441/002/edbt.2014.43
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, Istanbul, pp. 106–115 (2007). https://doi.org/10.1109/ICDE.2007.367856
Li, Z., Dang, T., Wang, T., Li, N.: MGD: a utility metric for private data publication, pp. 106–119 (2021). https://doi.org/10.1145/3491371.3491385
Lopuhaä-Zwakenberg, M., Tong, H., Škorić, B.: Data sanitisation protocols for the privacy funnel with differential privacy guarantees. Int. J. Adv. Secur. 13(3–4), 162–174 (2021). https://arxiv.org/abs/2008.13151
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. In: 22nd International Conference on Data Engineering (ICDE 2006), Atlanta, GA, USA (2006). https://doi.org/10.1109/ICDE.2006.1
Mahawaga Arachchige, P.C., Bertok, P., Khalil, I., Liu, D., Camtepe, S., Atiquzzaman, M.: Local differential privacy for deep learning. IEEE Internet Things J. 7, 5827–5842 (2020)
Matthews, G., Harel, O.: Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Stat. Surv. 5, 1–29 (2011). https://doi.org/10.1214/11-SS074
McKenna, R., Miklau, G., Hay, M., Machanavajjhala, A.: Optimizing error of high-dimensional statistical queries under differential privacy. In: Proceedings of the VLDB Endowment, vol. 11(10) (2018). https://doi.org/10.14778/3231751.3231769
McKenna, R., Sheldon, D., Miklau, G.: Graphical-model based estimation and inference for differential privacy abs/1901.09136 (2019). https://proceedings.mlr.press/v97/mckenna19a.html
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, pp. 94–103 (2007). https://doi.org/10.1109/FOCS.2007.66
Mironov, I.: Rényi differential privacy. In: IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275 (2017). https://doi.org/10.1109/CSF.2017.11
Moore, R.A.: Controlled data swap** techniques for masking use microdata sets. US Bureau of the Census, Statistical Research Division (1996). https://www.census.gov/srd/www/byyear.html2
Mugge, R.: Issues in protecting confidentiality in national health statistics. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 592–594 (1983)
Muralidhar, K., Parsa, R., Sarathy, R.: A general additive data perturbation method for database security. Manage. Sci. 45, 1399–1415 (1999)
Muralidhar, K., Domingo-Ferrer, J., Martínez, S.: epsilon-differential privacy for microdata releases does not guarantee confidentiality (let alone utility). In: Book: Privacy in Statistical Databases, UNESCO Chair in Data Privacy, International Conference, PSD 2020, Tarragona, Spain, 23–25 September 2020, Proceedings (2020). https://doi.org/10.1007/978-3-030-57521-2_2
Poole, W.K.: Estimation of the distribution function of a continuous type random variable through randomized response. J. Am. Stat. Assoc. 69(348), 1002–1005. Taylor and Francis (1974)
Rubin, D.B.: Discussion statistical disclosure limitation. J. Official Stat. 461–468, 461–468 (1993)
Salamatian, S., Calmon, F., Fawaz, N., Makhdoumi, A., Médard, M.: Privacy-utility tradeoff and privacy funnel (2020)
Sankar, L., Rajagopalan, S.R., Poor, H.V.: Utility-privacy tradeoffs in databases: an information-theoretic approach. IEEE Trans. Inform. Forensics Secur. 8(6), 838–852 (2013). https://doi.org/10.1109/TIFS.2013.2253320
Steinberg, J., Pritzker, L.: Some experiences with and reflections on data linkage in the united states. Bull. Int. Stat. Inst. 786–808 (1967)
Sweeney, L.: k-anonymity: A model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)
Torkzadehmahani, R., Kairouz, P., Paten, B.: DP-CGAN: differentially private synthetic data and label generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2019)
Wang, W., Ying, L., Zhang, J.: On the relation between identifiability, differential privacy, and mutual-information privacy. IEEE Trans. Inf. Theory 62(9), 5018–5029 (2016). https://doi.org/10.1109/TIT.2016.2584610
Warner, S.L.: Randomized response: a survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 60, 63–69 (1965)
Wasserman, L., Zhou, S.: A statistical framework for differential privacy. J. Am. Stat. Assoc. 105(489), 375–389 (2010). https://doi.org/10.1198/jasa.2009.tm08651
Winograd-Cort, D., Haeberlen, A., Roth, A., Pierce, B.C.: A framework for adaptive differential privacy. Proc. ACM Program. Lang. (2017). https://doi.org/10.1145/3110254
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., **ao, X.: PrivBayes: Private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017). https://doi.org/10.1145/3134428
Zhang, T.K.N.C., You, J.: Measuring identification risk in microdata release and its control by post-randomization. Center for Disclosure Avoidance Research U.S. Census Bureau Washington DC 20233 (2016)
Zhang, Z., et al.: PrivSyn: differentially private data synthesis. ar**v:2012.15128 (2021)
Acknowledgments
We are grateful to the reviewers of SciSec 2022 for their thorough proof-reading and valuable comments on our paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ghatak, D., Sakurai, K. (2022). A Survey on Privacy Preserving Synthetic Data Generation and a Discussion on a Privacy-Utility Trade-off Problem. In: Su, C., Sakurai, K. (eds) Science of Cyber Security - SciSec 2022 Workshops. SciSec 2022. Communications in Computer and Information Science, vol 1680. Springer, Singapore. https://doi.org/10.1007/978-981-19-7769-5_13
Download citation
DOI: https://doi.org/10.1007/978-981-19-7769-5_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7768-8
Online ISBN: 978-981-19-7769-5
eBook Packages: Computer ScienceComputer Science (R0)