Challenge Results, Victory vs. Sampling Error, Demo 11/8

8 views

Skip to first unread message

Christine Task

unread,

Sep 23, 2021, 6:33:56 PM9/23/21

to opendp-c...@g.harvard.edu

Happy Autumn!

For those of you who have been following along with the NIST differentially private synthetic data challenges, I wanted to share with you a very satisfying conclusion to this past year’s hard work.

First off— I believe our final results (below) have some exciting implications for the potential of differentially private synthetic data, even on exceptionally difficult real world data schema, but you don’t have to take our word for it. If you’d like to try these solutions for yourself, and learn how to configure them to run on your own data, register to attend our Demo Day, Nov 8^th:
https://attendee.gotowebinar.com/register/1418248371937151243

Now a quick recap--the NIST Differential Privacy Temporal Map Challenge started a year ago, in October of 2020. The first sprint focused on privatizing 911 and police response data from Baltimore. We then tackled synthesizing seven years (and 35 features) of American Community Survey data, and finally synthesizing millions of location sequence records with the Chicago Taxi Data set.

What made this challenging was the focus on time and maps.

The data was comprised of temporal sequences, up to 200 timestamped records per individual in Sprint 3. This meant that even simple counting queries could have quite high sensitivity.

And the scoring focused on ensuring fair performance for all communities. Solutions were ranked based on their average score across all map segments (e.g. neighborhoods) in the data. It’s often tempting focus on dense, simple, monolithic subgroups in the population, which are easier to model and may look nice in aggregate utility metrics. However this can result in very poor accuracy for individuals outside these groups. Map-based scoring meant solutions had to maintain good performance on a diverse array of sparse, complex communities.

Final Results--
At the end of our third sprint, we were able to successfully accomplish something we’d been hoping to achieve since the beginning of the first challenge back in 2018. Our contestants outperformed sampling error, providing formal privacy with better utility than traditional data release. You can read in more detail here:

https://www.drivendata.co/blog/differential-privacy-winners-sprint3/

Intuitively, this makes some sense. Subsampling is used along with anonymization as an informal approach to privacy protection, ideally providing respondents the ability to deny that their own records were included in the released sample. To strengthen that claim, a large portion of the data is usually withheld -- the American Community Survey only releases a 40% subsample of their microdata, for example. This subsampling can alter the distribution of the data, while at the same time failing to provide formal protection against reidentification, and it may not be absurd to think that there’s potential to do better on both privacy and accuracy. Differential privacy only requires adding sufficient randomization to obfuscate one individual’s contribution; it doesn’t need to eliminate 60% of the data.

Outperforming sampling error on synthetic data isn’t trivial. Our contestants put considerable work into tuning algorithms and deeply understanding the data. We’re proud of what they’ve accomplished, and we’d like to show you how they did it.

Four of the solutions have been posted as open source, and you can find their links here:

https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/current-and-upcoming-prize-challenges/2020-differential

And three of them have participated in our summer development contest: Minutemen, DPSyn, and Jim King. These solutions are being refactored with additional documentation/examples and more flexible configuration options, to make them suitable for use in future research. Join us for our November 8th Demo Day and learn how to try out our competitors’ approaches for yourself! Can you get them tuned for your data?
Register here: https://attendee.gotowebinar.com/register/1418248371937151243

Finally, how does your own synthetic data approach compare on the challenging, real world data we described above? Can you beat our competitors on American Community Survey data? Stay tuned! We’re working to make our benchmark problems more accessible, so you can try them out in your own research.

--Christine

Christine Task

Lead Privacy Researcher
Knexus Research Corporation

Reply all

Reply to author

Forward

0 new messages