Hi all,
This is just some quick information/reminder for this coming Monday’s Demo Day for the NIST Differential Privacy Synthetic Data Challenge. We hope you’ll be able to join us! But if not--
If you’d like to receive a pointer to the recordings, just fill out the short registration form below (even if you’re unable to attend) and we’ll send you one. The link will stay live through Tuesday (11/9), so even if you’re noticing this a day late, you
can still request the recording. The contestants’ repos are now live as well, see links below and have fun!
Date/Time: Monday, 11/8, from 11am – 1:30pm ET
Registration Link:
Please register here
Agenda:
11:00 am Welcome -- Gary Howarth, Prize Manger, NIST PSCR
11:10 am Introduction to the challenge – Christine Task, challenge Technical Lead, Knexus Research
11:25 am Benchmark problems for synthetic data -- Nicolas Grislain, CSO, Sarus Technologies
11:30 am Demo 1: Minutemen [2nd Place Winner]
12:00 pm Demo 2: DPSyn [3rd Place Winner]
12:30 pm Demo 3: Jim King [4th Place Winner]
1:00 pm Open Problems Discussion
1:30 pm Conclusion
Topics:
Challenge overview, tutorials on accessing/using the open sourced winning solutions so you can play around with them yourself and try them in your own work/research, and foreshadowing on public benchmark problems and future research. For more background on
the challenge, see the previous OpenDP list email, subject “victory vs sampling error”.
The Tools/Code:
For all three developed, open sourced synthetic data generators, the
NIST Differential Privacy Challenge Website now lists the links to each team’s code repository. There you will find executables, source code, quickstart guides with example data, and each team’s technical points contact (for any questions). Our teams
spent the summer making sure their data generators would be well documented, fully configurable and fun for you to tinker around with. We’ll give you the formal tour on Monday, but you’re welcome to look ahead.
Audience Participation:
We’re planning to take the last half hour to invite a discussion on a general question of interest--why do these solutions perform so well?
We’ll review a few tricks that contestants discovered were reliably helpful across all 6 sprints (including over event data, demographic data and GPS data), in 4 years of challenges.
For some of these tricks we can see empirically that they work in diverse contexts, but we’re
lacking formal analysis--
What properties of human data sets enable these techniques and how can we use them to prove tighter and much more
realistic utility performance bounds
on algorithms going forwards? The plan is to summarize our current findings and outstanding questions in a white paper, to support future research. And we’d appreciate your help.
On Monday we’ll discuss four things—subsampling/weighting to reduce sensitivity, use of hard public constraints (identifying empty sections of the data space), use of soft public constraints (identifying sparse sections of the data space), and heavily pruning
marginal queries (or vertically partitioning histograms). We’ll provide definitions and quick illustrations of what we think are the interesting bits of each technique. But these are just from our own observations, and as great at the challenges have been,
we’re fully aware that we don’t have a monopoly on high utility DP. So let’s make this a potluck. What’s the
one weird trick you’ve observed that improves performance in your own work on real world data? Come with your own explanatory paragraph, citation/reference, technical report/arXiv write-up, or just a captivating anecdote (even if you can’t attend), and
we’ll add
them to our collection.
Christine Task
Technical Lead, NIST Differential Privacy Synthetic Data Challenges
Lead Privacy Researcher, Knexus Research Corporation
Christi...@knexusresearch.com | https://knexusresearch.com/privacy/