How to anonymizing id field?

Ofrit Lesser

unread,

Jul 2, 2015, 10:38:43 AM7/2/15

to israel-r-...@googlegroups.com

Hi,

I'd like to anonymize my data before releasing it. Specifically I'm looking for a way to anonymize the 'id' field in my data.

The id field appears across several csv files. Here is a toy example of my data (there are 2 files x and y):

x <- data.frame(id=c(425, 126), item1=c(3,5))
y <- data.frame(id=c(123, 126, 504, 888), item2=c(3,5, 2,4))

all.id <- c(x$id, y$id)

The 'id' fields of both x and y may or may not overlap.

I'd like to anonymize the 'id's so that the anonymization will be consistent across the files (i.e. anonymize all.id) , but the original ids won't be exposed.

Any ideas for a quick and easy solution?

Thanks,

Ofrit

This email has been checked for viruses by Avast antivirus software.
www.avast.com

amit gal

unread,

Jul 2, 2015, 10:44:49 AM7/2/15

to israel-r-...@googlegroups.com

1. Create a vector of unique ids across all your data.
2. Create a vector of new, anonimized, ids of the same length
3. Name the vector from step 2 using the vector in step 1
4. Anonimize your data
X$nid = all.anonimized[x$id]

--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ofrit Lesser

unread,

Jul 2, 2015, 11:00:03 AM7/2/15

to israel-r-...@googlegroups.com

Thanks Amit.

This is a good idea however it does not fully solve my problem.

The problem is that in my case I'll need to apply the anonymization back to the original file.

If we look at my example, both x$id and y$id include id=126.

So we have 5 different ids across x and y, but the aggregated number of ids in both file is 6.

Any idea how to apply the consistent new names back to x and y?

Ofrit

amit gal

unread,

Jul 2, 2015, 11:15:10 AM7/2/15

to israel-r-...@googlegroups.com

What i suggested solves your problem. Im on mobile so cant send code. Will send later

Jonathan Rosenblatt

unread,

Jul 2, 2015, 1:05:15 PM7/2/15

to israel-r-user-group

How about hashing the ids?

Say, with git2r::hash()

Jonathan Rosenblatt
www.john-ros.com

amit gal

unread,

Jul 2, 2015, 1:39:37 PM7/2/15

to israel-r-...@googlegroups.com

so my code would look like:

all.ids = unique(c(x$id,y$id)) #add any other data frames you have

anonimized.ids = sample(length(all.ids)) #creating random, unique, anonimized new ids

names(anonimized.ids) = as.character(all.ids)

#now assigning new ids to old ids:

x$id = anonimized.ids[as.character(x$id)]

y$id = anonimized.ids[as.character(y$id)]

# now you can save your anonimized data...

# and you can save, separately, the anonimized.ids vector

# so you can later reverse the process if needed.

# don't publish this vector, though :)

amit gal

unread,

Jul 2, 2015, 1:40:54 PM7/2/15

to israel-r-...@googlegroups.com

i think that with hash() you have the slight chance that two different old ids will map to the same hash code, so you lose your 1-1 mapping between original and anonimized ids.

On Thu, Jul 2, 2015 at 8:04 PM, Jonathan Rosenblatt <john...@gmail.com> wrote:

Ofrit Lesser

unread,

Jul 4, 2015, 2:29:55 AM7/4/15

to israel-r-...@googlegroups.com

Many thanks Amit.

This is very useful!

Ofrit Lesser

unread,

Jul 4, 2015, 2:37:02 AM7/4/15

to israel-r-...@googlegroups.com

Jonathan and Amit.

Thank you both for the useful ideas, and for the prompt response J

I used the 'names' solution only due to the fact that this is part of the 'base' package.

Reply all

Reply to author

Forward