New NIST Synthetic Data Program -- A New Kind of Challenge

20 views
Skip to first unread message

Christine Task

unread,
Mar 2, 2023, 9:00:38 AM3/2/23
to opendp-c...@g.harvard.edu, Howarth, Gary S. (Fed)

Hello OpenDP Community!

In 2018 and 2020 NIST announced Synthetic Data Challenges here. Now we have something new for you-- like a challenge, but collaborative rather than competitive.  The Collaborative Research Cycle (CRC) aims to more formally understand data deidentification as a whole (synthetic data methods and others). We provide the data and metrology, the community provides deidentified data, and we all learn together. 

 

Register here for our newsletter, and to see our kick-off webinar on 3/7/23.

 

Our premise is that the most interesting research problems happen in cycles: first the idea, which many of you already have, then the engineering to implement the idea on a real-world use case, then the engagement with real-world experts, and finally that's when we start to look more closely at the problem and realize "hey, something’s weird here".  


And then new things happen.  Research, engineering, and engagement lead to better research.

 

You can learn more about the project here, or check out our project website,
---but the gist is:

  • We have curated data from diverse, real world communities.
  • We have an extensive library of evaluation/visualization metrics sourced from data experts around the world, that you can run at home.
  • We want you to run your privacy technique (or any privacy technique) on our data.  
  • Tell us what you did, send us your deidentified data, and we'll send you a very pretty evaluation report.


Because, then:

  • We will compile what everyone has sent us– data, techniques, and metric results-- and release it all publicly to support "hey this is weird" research into the ways different techniques behave on diverse, real-world data.
  • We will issue a call for a "Tiny Paper Track" so you can submit your observations on the research problem you discovered from our resources (early and incremental results are welcome!)
  • We will hold an "Explanatory Workshop" and release proceedings to share what we've all, collaboratively, learned about the behavior of privacy algorithms on diverse communities.

We're hoping for two things from this program-- First, some really fun math and data research problems (which we're already seeing). But we are also hoping to accelerate the sort of robust, formal understanding of privacy systems that's necessary to ensure we can deploy them safely, without unexpected negative consequences.

If you'd like to follow along with us as we do all this, you can subscribe to our newsletter. And if you think you might like to participate (either submitting deidentified data samples, or joining in the collaborative research efforts), just register a team.

We already have some great data deidentification techniques in our collection.  Do you have one you'd like to add?  Register your team, and follow the website directions to submit it!

Christine Task
Lead Privacy Researcher
Knexus Research Corporation
Christi...@knexusresearch.com

Gary Howarth
NIST Scientist, Program Officer
National Institute of Standards and Technology 

Gary.h...@nist.gov

 

 

 

Christine Task

unread,
Mar 2, 2023, 2:53:51 PM3/2/23
to opendp-c...@g.harvard.edu, Howarth, Gary S. (Fed)

Hello all--

One minor adjustment to the previous email (below):  If you’d like to sign up to the NIST CRC listserv to follow along with our project news and updates, just use this mailto link:

Join CRC listserv for news and updates (send an empty email to subscribe)

Thanks!

--Christine

Christine Task

unread,
Apr 12, 2023, 8:19:22 PM4/12/23
to opendp-c...@g.harvard.edu, Howarth, Gary S. (Fed)

Hello all, 

 

The National Institute of Standards and Technology CRC program, announced here last month, is well under way using an innovative suite of metrics to collaboratively evaluate/visualize the behaviors of different data deidentification techniques on diverse data.   On the off chance you’d like to come collaboratively evaluate it too, we’re sharing preliminary results of interest to the community– What really is privacy and utility at epsilon 10?   

It turns out that can vary significantly, depending on what DP technique you’re using.   So far our participants have looked at histograms, marginal methods, GAN and transformer networks, constraint satisfaction methods and even a genetic algorithm.  A nice colorful meta-report comparing these techniques at epsilon 10 is now available on the OpenDP Slack (in the “#crc-office-hours” channel), and we’ll be available there too, to answer questions and chat.  Our thanks to OpenDP for hosting our office hours discussions!   The slack link is a simple click through to check things out—no need to have previously been a member of the slack channel.

If you’d like to learn more about the techniques currently in our collection, see the CRC website.  To help contribute new ones (or new samples of existing techniques, with different configs), check our participant instructions. And to follow along with future updates like the one above– you can join our listserv (just send an empty email).   

Next month we will be issuing a Call for Papers, soliciting bite-sized (3pg + abstract) workshop papers that identify, explore, and start to analyze some of the fundamental patterns we’re seeing across different algorithms as they attempt to deidentify our diverse communities data.   The submission deadline will be 9/29, and we’ll be holding discussions periodically through the summer. 

We have interesting observations already in the epsilon 10 report, some of which may have implications for your own research if you’re working in the data deidentification space (and are concerned about performance on diverse populations).  If you can take a moment to drop into the slack and look at our reports, we’d love to have your thoughts. 

Feel free to direct any questions to Gary Howarth, NIST or Christine Task, Knexus Research. 

--
You received this message because you are subscribed to the Google Groups "opendp-community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opendp-communi...@g.harvard.edu.
To view this discussion on the web visit https://groups.google.com/a/g.harvard.edu/d/msgid/opendp-community/BN2P110MB11904EC7971DEDA66D030368E4B29%40BN2P110MB1190.NAMP110.PROD.OUTLOOK.COM.

Christine Task

unread,
Sep 20, 2023, 12:20:13 PM9/20/23
to opendp-c...@g.harvard.edu, Howarth, Gary S. (Fed)

Hi all—

What’s your favorite recent privacy research on tabular data (ex: census data, federal agency data—columns of information on individuals)?   How do your contributions compare to other people’s? 

Want to find out?

This December, NIST is holding a (virtual) debutante ball for our Tabular Benchmark Data, and you’re all invited. The
Diverse Communities Excerpts Data is designed to be very challenging but tractable, by experts who are familiar with both requirements.  It’s curated from the 2018-2019 American Community Survey, with features and demographically diverse geographies that showcase complex distributions over a manageably small schema.  Over the summer we’ve built tools to make it fun and easy to work with, and we’ve collected over 450 deidentified samples of this data from a variety of privacy techniques, research groups and stakeholders.   

And we’ve learned a lot already.  Grounding diverse research on common benchmark data enables us to efficiently compare, combine, and draw implications across observations from very different groups; we’re accelerating the natural
Collaborative Research Cycle


We’d like to include your research in our work.  From now until Nov 7th we’re accepting non-archival
4-page Research Report Submissions (Call for Papers) that apply existing or new research to the Diverse Communities Excerpts benchmark data.   



  • Submissions are welcome to take existing, previously published research, and just provide a new evaluation section related to the Diverse Excerpts Benchmark Data
  • Any work in privacy on tabular data is welcome and useful; this invitation is not limited to synthetic data.  Privacy-preserving ML, query systems, confidence interval computation, reconstruction attacks, etc… are all relevant.  The same quirks of the data distribution will impact many different privacy applications, and robustly understanding those is our ultimate goal.   
  • Of course we always encourage new contributions to our ever-growing archive of deidentificaiton techniques. 
  • Also welcome are research report submissions that perform analysis or meta-analysis of the current contents of the deidentified data sample archive (which we’ve made especially fun, and appropriate for students). If your research topic is analysis of deidentified data, we have an awful lot of that for you to explore.
  • Many of us will be gathering in Boston next week for TPDP 2023 and the OpenDP Community Meeting.   Do you have a poster in TPDP?   Try your work out on our data and get another poster at our virtual workshop this December, and more importantly get included in our proceedings.


As always, we will use submissions to motivate analysis of opportunities and roadblocks in this research area, and support future programs designed to address them. We welcome you to contribute your perspective. We anticipate this work will result in an improved understanding of data privacy tools and a more comprehensive view of where that understanding is lacking, allowing us to identify new open research problems. 

If you have any questions, concerns, would like to chat or get a tour of our resources---please feel free to reach out to Gary and myself!

Reply all
Reply to author
Forward
0 new messages