ZetaSQL - beginner questions

114 views
Skip to first unread message

Mihail-Iulian Pleșa

unread,
May 2, 2023, 6:49:41 AM5/2/23
to DP Open Source Users
Hi all, 

I am new to DP and currently, I'm trying to understand DP on SQL queries. I have a few questions regarding the DP command line interface for ZetaSQL

1. How to choose epsilon, delta and kappa? I know that to choose the right epsilon we must compute what is called "sensitivity"  but I don't that in the examples from the repo. 
2. What algorithms do you use to compute DP average or DP counts?  
3. What type of DP does the library use: local differential privacy or global differential privacy? I know the definitions but I don't know how to fit different DP projects into the two classes. This will help me understand better.
4. This is the most important for me since I didn't find any clear resources for it. In the restaurant example - count visits by hour of a day. Suppose that I have an initial database, I run the DP SQL query and I release the results. After a while, I add more data to the original database (this is a common scenario in industry). I run the same DP SQL query again and I release the results. How does this impact data privacy? 

Thank you very much and I apologize if the questions are a little silly (I'm still learning what it's all about).

miracb

unread,
May 2, 2023, 7:09:47 AM5/2/23
to DP Open Source Users
Hi Mihail-Iulian,

1) Actually, noise is computed from the epsilon, delta and sensitivity (kappa and clamping bounds) you choose. Epsilon & delta are privacy parameters, others just affect how your data is clamped and the accuracy of the output as a result, i.e. for a fixed epsilon & delta, changing kappa or clamping bounds do not change privacy guarantees. Here is some pointers for choosing epsilon & delta. Choice of kappa depends on the shape of the dataset, you can aim to pick a value that would keep the contributions of most users except for the outliers. E.g., you can pick kappa so that 90-95% of the users contribute to at most that many partitions.
2) ZetaSQL DP uses C++ Building Blocks Library under the hood, so they are the algorithms used in the Building Blocks Library: https://github.com/google/differential-privacy/tree/main/cc/algorithms
3) It uses central (global) differential privacy.
4) That would mean you exhausted privacy budget twice for the "old" users in your dataset and this could be leveraged for differencing attacks. To prevent this, you could run a DP SQL query for only the new users whenever you want to include and merge the results with the DP results from the old users. This would keep you privacy budget the same and prevent differencing attacks at the cost of being more noisy (because you added noise twice to the statistics now).


I hope that helps!

Mihail-Iulian Pleșa

unread,
May 3, 2023, 7:00:34 AM5/3/23
to miracb, DP Open Source Users
Hi,

Thank you for your answer. 

I read the paper and I don’t understand whether the library works if a user has more than one records or not. What puzzles me is that in the beginning of 3.2 (Bounded-contribution aggregation) the authors state that “ Importantly, at this step, we assume that each user’s contributions have been aggregated to a single input row - this property is en- forced by the query rewriter”. 
After all, what is the privacy unit ?

The paper: 
Thank you!

--
You received this message because you are subscribed to a topic in the Google Groups "DP Open Source Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dp-open-source-users/5sC-YSNlWeg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dp-open-source-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dp-open-source-users/520b4c60-f63c-4a56-a188-ae6dd0526866n%40googlegroups.com.
--
Cu respect,
Mihail Plesa

Miraç Vuslat Başaran

unread,
May 3, 2023, 7:00:38 AM5/3/23
to Mihail-Iulian Pleșa, DP Open Source Users
Hi,

ZetaSQL DP does work when a user has more than one record (and so does Privacy on Beam and PipelineDP). It does so by asking for the privacy unit, kappa (I believe it is max_groups_contributed in the newer versions) and clamping bounds. See the documentation for privacy units, you need to set the privacy_unit_column or anonymization_userid_column.

Cheers,
Mirac

Mirac Vuslat Basaran

Software Engineer

mir...@google.com
+49 152 2153 1008


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Liana Sebastian

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde. 

     

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Reply all
Reply to author
Forward
0 new messages