i am hoping someone could give advice on a problem. I have two
dataframes and i want to compare them to find the intersect,
subtraction, etc. below is some simple data, my progress so far and
questions. Thanks in advance for any help:
#sample data
id1 <- c('patient1', 'patient2', 'patient3')
id2 <- c('patient1', 'patient2', 'patient4')
coefficient1 <- c(1, 1, 1)
coefficient2 <- c(1, 2, 3)
df1 <- data.frame(id=id1, coefficient=coefficient1);
df2 <- data.frame(id=id2, coefficient=coefficient2);
IdxToDelete <- setdiff(df1$id, df2$id);
#how to find rows from df1 where df1$id is in IdxToDelete?
#i would expect this to return the row where Id='patient3'
IdxToInsert <- setdiff(df2$id, df1$id);
#how to find rows in df2 where df1$id is in IdxToInsert?
#i would expect this to return the row where Id='patient4'
IdxToUpdate <- intersect(df1$id, df2$id);
#this one is more complex. first, how to find rows in df1 or df2
where df1$id is in IdxToUpdate?
#i would expect this to return the rows where Id='patient1' or 'patient2'
#then how do i most efficiently find any records where the coefficient
on these records differs between the two datasets?
Take this example:
vec1 <- c(1,2,3,NA)
vec2 <- c(1,3,NA,NA)
toGet <- vec1 != vec2
then i plan to use this vector to extract records from a dataset:
endProduct <- someDataframe[toGet]
The problem is "toGet" contains NAs:
FALSE, TRUE, NA, NA
Using the above data, I would hope to make it produce:
FALSE, TRUE, TRUE FALSE
What's the right approach to handle the null values? Should I convert
them to zero in the original vectors? If so, what's the best R
replace function?
Thanks again for the help.
coefficient1 <- oldRecords[match(IdxToUpdate, oldRecords$Id),]
coefficient2 <- newRecords[match(IdxToUpdate, newRecords$Id),]
x<-coefficient2$coefficient!=coefficient1$coefficient
toUpdate <- coefficient1[x,]
On Sun, Oct 24, 2010 at 8:05 PM, Karl Broman <kbr...@biostat.wisc.edu> wrote:
> I mostly use the function match() for such things.
>
> Karl
I few weeks ago I emailed this list asking about comparing 2 datasets
based on a column (essentially finding the intersect). I was advised
to use a combination of setDiff() or intersect() to find the Id values
that needed to be updated. I then used match() to subset the
dataframe (see email chain below for code). This works when comparing
a dataframe using 1 column.
I am now trying to figure out how to perform a similar operation using
2 columns of the datasets. Take this example:
#sample data
idVec1 <- c('patient1', 'patient2', 'patient3')
idVec2 <- c('patient4', 'patient5', 'patient12')
idVec3 <- c('patient1', 'patient8', 'patient9')
idVec4 <- c('patient4', 'patient11', 'patient12')
coefficient1 <- c(1, 1, 1)
coefficient2 <- c(1, 2, 3)
oldRecords <- data.frame(id=idVec1, id2=idVec2, coefficient=coefficient1);
newRecords <- data.frame(id=idVec3, id2=idVec4, coefficient=coefficient2);
I'd like to identify all rows from oldRecords where both Id1 and Id2
are shared with df2. This would return the first row of that
dataframe. in the previous email, I was advised to use setDiff() and
match(), along the lines of:
IdxToDelete <- setdiff(oldRecords$id, newRecords$id);
oldRecords[match(IdxToDelete, oldRecords$Id),]
In order to adapt this for 2 column, I tried the following:
compare1 <- oldRecords$id %in% newRecords$id
compare2 <- oldRecords$id2 %in% newRecords$id2
That produces 2 logical vectors. the result is:
> compare1
[1] TRUE FALSE FALSE
> compare2
[1] TRUE FALSE TRUE
I want to find all rows in the dataframe where both compare1 and
compare2 are true. I thought this would have worked:
oldRecords[compare1 && compare2,]
however, that returned all rows. Can these 2 vectors be used to
subset a dataframe? Is there a better solution for this dataframe
comparison than this one?
thanks for the help.
-ben
oldRecords[compare1 & compare2,]
Note the difference:
> compare1 <- c(TRUE, FALSE, FALSE)
> compare2 <- c(TRUE, FALSE, TRUE)
> compare1 && compare2
[1] TRUE
> compare1 & compare2
[1] TRUE FALSE FALSE
karl
-ben
One more question: In this vector comparison, order matters, and I
don't think %in% is accounting for that:
#sample data
idVec1 <- c('patient1', 'patient2', 'patient3')
idVec3 <- c('patient1', 'patient8', 'patient2')
oldRecords <- data.frame(id=idVec1);
newRecords <- data.frame(id=idVec3);
Now we try:
> compare1 <- oldRecords$id %in% newRecords$id
> compare1
[1] TRUE TRUE FALSE
>
When we want TRUE FALSE FALSE. So I tried this:
> compare2 <- oldRecords$id == newRecords$id
Error in Ops.factor(oldRecords$id, newRecords$id) :
level sets of factors are different
> compare2
logical(0)
>
Then tried:
> compare2 <- as.character(oldRecords$id) == as.character(newRecords$id)
> compare2
[1] TRUE FALSE FALSE
Is performing an '==' comparison with as. character() the right approach?
-Ben
oldids <- paste(oldRecords$id, oldRecords$id2, sep=":")
newids <- paste(newRecords$id, newRecords$id2, sep=":")
oldRecords[!is.na(match(oldids, newids)),]
You might also worry about the order of the two, and so also compare to the switched version:
newidsalt <- paste(newRecords$id2, newRecords$id, sep=":")
oldRecords[!is.na(match(oldids, newids)) | !is.na(match(oldids, newidsalt)) , ]
karl
Yes- pasting IDs is better. After I wrote that email I realized that
issue was not order per se, but that Id1 AND Id2 needed to match in
the same row. For example:
Patient1/Patient2
Patient3/Patient4
compared against:
Patient3/Patient4
Patient1/Patient2
should match both rows even though they are out of order. however:
Patient1/Patient2
Patient3/Patient4
and
Patient1/Patient4
Patient3/Patient2
should match zero rows. the above would fail when using '%in%',
unless columns are pasted together.
Thanks again for the help. I think I can make your last email work.
-Ben