How to anonymizing id field?

12 views
Skip to first unread message

Ofrit Lesser

unread,
Jul 2, 2015, 10:38:43 AM7/2/15
to israel-r-...@googlegroups.com

Hi,

 

I'd like to anonymize my data before releasing it. Specifically I'm looking for a way to anonymize the 'id' field in my data.

The id field appears across several csv files. Here is a toy example of my data (there are 2 files x and y):

 

x <- data.frame(id=c(425, 126), item1=c(3,5))
y <- data.frame(id=c(123, 126, 504, 888), item2=c(3,5, 2,4))

all.id <- c(x$id, y$id)

 

The 'id' fields of both x and y may or may not overlap.

I'd like to anonymize the 'id's so that the anonymization will be consistent across the files (i.e. anonymize all.id) , but the original ids won't be exposed.

Any ideas for a quick and easy solution?

 

Thanks,

Ofrit




Avast logo

This email has been checked for viruses by Avast antivirus software.
www.avast.com


amit gal

unread,
Jul 2, 2015, 10:44:49 AM7/2/15
to israel-r-...@googlegroups.com

1. Create a vector of unique ids across all your data.
2. Create a vector of new, anonimized, ids of the same length
3. Name the vector from step 2 using the vector in step 1
4. Anonimize your data
X$nid = all.anonimized[x$id]

--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ofrit Lesser

unread,
Jul 2, 2015, 11:00:03 AM7/2/15
to israel-r-...@googlegroups.com

Thanks Amit.

This is a good idea however it does not fully solve my problem.

The problem is that in my case I'll need to apply the anonymization back to the original file.

If we look at my example, both x$id and y$id include id=126.

So   we have 5 different ids across x and y, but the aggregated number of ids in both file is  6.

Any idea how to apply the consistent new names back to x and y?

 

Ofrit

amit gal

unread,
Jul 2, 2015, 11:15:10 AM7/2/15
to israel-r-...@googlegroups.com

What i suggested solves your problem. Im on mobile so cant send code. Will send later

Jonathan Rosenblatt

unread,
Jul 2, 2015, 1:05:15 PM7/2/15
to israel-r-user-group
How about hashing the ids?
Say, with git2r::hash()

Jonathan Rosenblatt
www.john-ros.com

amit gal

unread,
Jul 2, 2015, 1:39:37 PM7/2/15
to israel-r-...@googlegroups.com
so my code would look like:

all.ids = unique(c(x$id,y$id))  #add any other data frames you have
anonimized.ids = sample(length(all.ids))  #creating random, unique, anonimized new ids
names(anonimized.ids) = as.character(all.ids)
#now assigning new ids to old ids:
x$id = anonimized.ids[as.character(x$id)]
y$id = anonimized.ids[as.character(y$id)]
# now you can save your anonimized data...

# and you can save, separately, the anonimized.ids vector
# so you can later reverse the process if needed.
# don't publish this vector, though :)

amit gal

unread,
Jul 2, 2015, 1:40:54 PM7/2/15
to israel-r-...@googlegroups.com
i think that with hash() you have the slight chance that two different old ids will map to the same hash code, so you lose your 1-1 mapping between original and anonimized ids.


On Thu, Jul 2, 2015 at 8:04 PM, Jonathan Rosenblatt <john...@gmail.com> wrote:

Ofrit Lesser

unread,
Jul 4, 2015, 2:29:55 AM7/4/15
to israel-r-...@googlegroups.com

Many thanks Amit.

This is very useful!

Ofrit Lesser

unread,
Jul 4, 2015, 2:37:02 AM7/4/15
to israel-r-...@googlegroups.com

Jonathan and Amit.

Thank you both for the useful ideas, and for the prompt response J

 

I used the 'names' solution only due to the fact that this is part of the 'base' package.

Reply all
Reply to author
Forward
0 new messages