Hi,
I'd like to anonymize my data before releasing it. Specifically I'm looking for a way to anonymize the 'id' field in my data.
The id field appears across several csv files. Here is a toy example of my data (there are 2 files x and y):
x <- data.frame(id=c(425, 126), item1=c(3,5))
y <- data.frame(id=c(123, 126, 504, 888), item2=c(3,5, 2,4))
all.id <- c(x$id, y$id)
The 'id' fields of both x and y may or may not overlap.
I'd like to anonymize the 'id's so that the anonymization will be consistent across the files (i.e. anonymize all.id) , but the original ids won't be exposed.
Any ideas for a quick and easy solution?
Thanks,
Ofrit
This email has been checked for viruses by Avast antivirus software.
|
1. Create a vector of unique ids across all your data.
2. Create a vector of new, anonimized, ids of the same length
3. Name the vector from step 2 using the vector in step 1
4. Anonimize your data
X$nid = all.anonimized[x$id]
--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thanks Amit.
This is a good idea however it does not fully solve my problem.
The problem is that in my case I'll need to apply the anonymization back to the original file.
If we look at my example, both x$id and y$id include id=126.
So we have 5 different ids across x and y, but the aggregated number of ids in both file is 6.
Any idea how to apply the consistent new names back to x and y?
Ofrit
What i suggested solves your problem. Im on mobile so cant send code. Will send later
Many thanks Amit.
This is very useful!
Jonathan and Amit.
Thank you both for the useful ideas, and for the prompt response J
I used the 'names' solution only due to the fact that this is part of the 'base' package.