Privacy-proof way to share data set for debugging?

Rolf Blijleven

unread,

Aug 3, 2022, 4:39:39 AM8/3/22

to OpenRefine

Hi all,

I put in a possible bug report on GitHub, and my data set is required to reproduce the problem. This has names and birth dates, and it's not mine.

How can I share this data without putting it public on Github?

I could zip the project, put a password on it, upload to Github and mail (or better Signal) the password to the lucky ones who won the debugging task. Is there a better way, or how do you normally do this?

Cheers,

Rolf

Owen Stephens

unread,

Aug 3, 2022, 5:16:01 AM8/3/22

to OpenRefine

I don't think there is a good answer to this to be honest.

Generally I'd say if the data set is not shareable publicly then there isn't a good way of sharing in the context of public OpenRefine support. Note that while we have a group of people who commonly help debug issues, as a project I don't see that we could collectively take responsibility for the privacy of the data.

If the data isn't publicly shareable then if it were me I'd want some kind of formal agreement in place with any individual or organisation to ensure confidentiality etc. - basically a support contract in this context. Once that's in place then you could agree with that individual/org the best way of sharing with appropriate security.

The reason people request the data set when debugging is that they are trying to recreate the problem and narrow down the possible causes. Working with the data where the problem was originally observed is one way to narrow down the issue - because if it always fails for you with a data set, but works for someone else with the identical data set, we know that the issue lies somewhere other than the data.

So my first step, if I were you, would be to see if I see the same issue with another, shareable, dataset. If so, you can then share that dataset as the example. If not, then it seems likely that something about the original data set is causing the problem. In the latter case we get into more difficult territory because we now know the issue is with some configuration of data, but we don't know how to recreate. However that's a worry for if we get there. The first thing to establish is whether you see the same problem with some publicly shareable data - and if so, share that data on the issue

Best wishes

Owen

Rolf Blijleven

unread,

Aug 3, 2022, 6:22:11 AM8/3/22

to openr...@googlegroups.com

Thanks, Owen.

I posted the question here because that's what's recommended in the bug report template. I'm a bit surprised. Apparentely it hasn't occurred before often enough for it to have a formal solution. I work with people data all the time.

Anyway. Not to worry, I'll first try to reproduce on a different machine, if I see it there I'll check with my customer if they have any objections with me sharing.

Many thanks!

Rolf

Op wo 3 aug. 2022 om 11:16 schreef Owen Stephens <ow...@ostephens.com>:

--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/3TP6WTJsGIk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/fe61c463-6844-47a6-896a-ea1ec51c5996n%40googlegroups.com.

Owen Stephens

unread,

Aug 3, 2022, 6:30:18 AM8/3/22

to OpenRefine

On Wednesday, August 3, 2022 at 11:22:11 AM UTC+1 rbl...@gmail.com wrote:

Thanks, Owen.

I posted the question here because that's what's recommended in the bug report template. I'm a bit surprised. Apparentely it hasn't occurred before often enough for it to have a formal solution. I work with people data all the time.

I hadn't realised the bug report template said this!

My answer should be taken only as a personal response - I don't speak for the project as a whole and if I've given any incorrect information here apologies!

Owen

Thad Guidry

unread,

Aug 3, 2022, 10:28:22 AM8/3/22

to openr...@googlegroups.com

The best way to achieve the debugging goals and keep the data private is to create new fake data that still reproduces the issue.

You could probably make a new file with a few fake person strings, dates, etc.?

Thad

https://www.linkedin.com/in/thadguidry/

https://calendly.com/thadguidry/

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/f6f9d11a-840e-419e-b27b-a6c9cad647ean%40googlegroups.com.

Rolf Blijleven

unread,

Aug 3, 2022, 2:47:45 PM8/3/22

to openr...@googlegroups.com

Hi Thad,

I already did that. Problem does not appear on a few lines, but it does on the 5600+ that I have.

I've prepared this data for others to clean up. They'll just have to use version 3.5.2. That works fine.

On the one hand, it would be good to have something in place to share data in a way that is GDPR compliant.

On the other hand, I understand that in the way open source works, it's very hard to have a prompt, adequate response when asked to destroy a person's data. It could be all over the planet.

So I guess I'll leave it like it is, for now.

Cheers,

Rolf

Op wo 3 aug. 2022 16:28 schreef Thad Guidry <thadg...@gmail.com>:

You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/3TP6WTJsGIk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CAChbWaPSuC4PJh%3DbkP9SYhE%2BFgBo7x5CG9LguAdftXDHyfPk3w%40mail.gmail.com.

Reply all

Reply to author

Forward