Generate a synthetic dataset that approximates many randomly-chosen marginal distributions of the input dataset.
Team members: Ninghui Li, Zhikun Zhang, Tianhao Wang
Ninghui Li received a Bachelors of Engineering in computer science from the University of Science and Technology of China in 1993, and the M.S. and Ph.D. degrees in computer science from New York University, in 1998 and 2000 respectively. He is currently a professor of computer science in Purdue University. Prior to joining Purdue University in 2003, he was a research associate at the computer science department in Stanford University. His research interests are in security and privacy in information systems. He has published more than 150 technical papers in refereed journals and conference proceedings. He was elected Chair of the ACM Special Interest Group on Security, Audit and Control (SIGSAC) in 2017, and was the Program Chair of ACM Conference on Computer and Communications Security (CCS) in 2014 and 2015. He is on the editorial boards of ACM Transactions on Privacy and Security (TOPS) (Formerly TISSEC), ACM Transactions on Internet Technology, and Journal of Computer Security. He is a senior member of the IEEE and an ACM distinguished scientist.
Zhikun Zhang received a Bachelors of Engineering in automation in 2014 from Shandong University, Jinan, China. From October 2017 to May 2019, he is a visiting scholar with Purdue University, West Lafayette, IN, USA. He is currently working toward his Ph.D. degree in the Group of Networked Sensing and Control (NeSC) in the State Key Laboratory of Industrial Control Technology, Zhejiang University. His research interests include mechanism design, differential privacy and its applications in marginal release, location privacy, machine learning and crowdsensing system.
Tianhoa Wang received his Bachelors of Engineering degree from Software School, Fudan University in 2015. He is pursuing a Ph.D. degree in computer science at Purdue University. His research interests include differential privacy. He received multiple student travel grants from top security conferences, and received 2018 Emil Stefanov Memorial Fellowship.
DPSyns aims to generate a synthetic dataset while satisfying differential privacy. The algorithm builds on our previous work PriView (published at SIGMOD'14), and has three steps. Given a dataset as input, the first step is to generate many randomly selected noisy marginals on the dataset. Each marginal is specified by a subset of the attributes, and can be viewed as a projection from the full contingency table to those attributes. The algorithm decides the number of marginals and how many attributes are to be contained in each marginal, computes these marginals, and adds noises to them so that differential privacy is satisfied. In the second step, we use techniques developed in PriView to make all noisy marginals consistent with each other. The techniques presented in Priview were for binary attributes. We have already extended those techniques to categorical attributes. In PriView, it was shown that one can use these noisy marginals to reconstruct arbitrary marginals with high accuracy. This suggests that these noisy marginals capture a lot of information in the input dataset, and can be used for a broad range of data analysis tasks. The third step is to generate a synthetic dataset given these consistent marginals. There are a few alternative methods. One method starts with a randomly generated dataset and gradually changes it to be consistent with the noisy marginals. Another alternative is to use these noisy marginals to construct probabilistic graphical models of the dataset, and then synthesize data from these probabilistic models. The key challenge is efficiency. To be able to preserve information in datasets with dozens or more attributes, we expect to use dozens or more noisy marginals, each including 5-10 attributes. We expect the implemented software tool can generate datasets with millions of records.
Back to The Unlinkable Data Challenge