sub-setting data frames?

Ben Bimber

unread,

Oct 24, 2010, 9:01:08 PM10/24/10

to mad...@googlegroups.com

hello,

i am hoping someone could give advice on a problem. I have two
dataframes and i want to compare them to find the intersect,
subtraction, etc. below is some simple data, my progress so far and
questions. Thanks in advance for any help:

#sample data
id1 <- c('patient1', 'patient2', 'patient3')
id2 <- c('patient1', 'patient2', 'patient4')
coefficient1 <- c(1, 1, 1)
coefficient2 <- c(1, 2, 3)
df1 <- data.frame(id=id1, coefficient=coefficient1);
df2 <- data.frame(id=id2, coefficient=coefficient2);

IdxToDelete <- setdiff(df1$id, df2$id);
#how to find rows from df1 where df1$id is in IdxToDelete?
#i would expect this to return the row where Id='patient3'

IdxToInsert <- setdiff(df2$id, df1$id);
#how to find rows in df2 where df1$id is in IdxToInsert?
#i would expect this to return the row where Id='patient4'

IdxToUpdate <- intersect(df1$id, df2$id);
#this one is more complex. first, how to find rows in df1 or df2
where df1$id is in IdxToUpdate?
#i would expect this to return the rows where Id='patient1' or 'patient2'
#then how do i most efficiently find any records where the coefficient
on these records differs between the two datasets?

Douglas Bates

unread,

Oct 25, 2010, 11:28:19 AM10/25/10

to mad...@googlegroups.com

You may want to check the help files for "match" and "%in%" as well as
"merge". Phil Spector's book "Data Manipulation in R" is a good
source for answers to such questions.

Ben Bimber

unread,

Oct 25, 2010, 1:39:15 PM10/25/10

to maduser

thanks to both of you for the advice. that solved most of the
problem. I have one more question involving null values:

Take this example:

vec1 <- c(1,2,3,NA)
vec2 <- c(1,3,NA,NA)
toGet <- vec1 != vec2

then i plan to use this vector to extract records from a dataset:
endProduct <- someDataframe[toGet]

The problem is "toGet" contains NAs:
FALSE, TRUE, NA, NA

Using the above data, I would hope to make it produce:
FALSE, TRUE, TRUE FALSE

What's the right approach to handle the null values? Should I convert
them to zero in the original vectors? If so, what's the best R
replace function?

Thanks again for the help.

coefficient1 <- oldRecords[match(IdxToUpdate, oldRecords$Id),]
coefficient2 <- newRecords[match(IdxToUpdate, newRecords$Id),]

x<-coefficient2$coefficient!=coefficient1$coefficient
toUpdate <- coefficient1[x,]

On Sun, Oct 24, 2010 at 8:05 PM, Karl Broman <kbr...@biostat.wisc.edu> wrote:
> I mostly use the function match() for such things.
>
> Karl

Karl Broman

unread,

Oct 25, 2010, 1:42:45 PM10/25/10

to mad...@googlegroups.com

NAs are always a pain. You need to include suitable use of is.na(). Something like the following:

toGet <- (!is.na(vec1) & is.na(vec2)) | (is.na(vec1) & !is.na(vec2)) | (!is.na(vec1) & !is.na(vec2) & vec1 != vec2)

karl

Ben Bimber

unread,

Jan 5, 2011, 11:15:40 AM1/5/11

to mad...@googlegroups.com

Hello,

I few weeks ago I emailed this list asking about comparing 2 datasets
based on a column (essentially finding the intersect). I was advised
to use a combination of setDiff() or intersect() to find the Id values
that needed to be updated. I then used match() to subset the
dataframe (see email chain below for code). This works when comparing
a dataframe using 1 column.

I am now trying to figure out how to perform a similar operation using
2 columns of the datasets. Take this example:

#sample data
idVec1 <- c('patient1', 'patient2', 'patient3')
idVec2 <- c('patient4', 'patient5', 'patient12')
idVec3 <- c('patient1', 'patient8', 'patient9')
idVec4 <- c('patient4', 'patient11', 'patient12')

coefficient1 <- c(1, 1, 1)
coefficient2 <- c(1, 2, 3)

oldRecords <- data.frame(id=idVec1, id2=idVec2, coefficient=coefficient1);
newRecords <- data.frame(id=idVec3, id2=idVec4, coefficient=coefficient2);

I'd like to identify all rows from oldRecords where both Id1 and Id2
are shared with df2. This would return the first row of that
dataframe. in the previous email, I was advised to use setDiff() and
match(), along the lines of:

IdxToDelete <- setdiff(oldRecords$id, newRecords$id);
oldRecords[match(IdxToDelete, oldRecords$Id),]

In order to adapt this for 2 column, I tried the following:
compare1 <- oldRecords$id %in% newRecords$id
compare2 <- oldRecords$id2 %in% newRecords$id2

That produces 2 logical vectors. the result is:

> compare1
[1] TRUE FALSE FALSE
> compare2
[1] TRUE FALSE TRUE

I want to find all rows in the dataframe where both compare1 and
compare2 are true. I thought this would have worked:

oldRecords[compare1 && compare2,]

however, that returned all rows. Can these 2 vectors be used to
subset a dataframe? Is there a better solution for this dataframe
comparison than this one?

thanks for the help.

-ben

Karl Broman

unread,

Jan 5, 2011, 11:18:09 AM1/5/11

to mad...@googlegroups.com, Ben Bimber

Just replace the && with an &.

oldRecords[compare1 & compare2,]

Note the difference:

> compare1 <- c(TRUE, FALSE, FALSE)
> compare2 <- c(TRUE, FALSE, TRUE)
> compare1 && compare2
[1] TRUE
> compare1 & compare2
[1] TRUE FALSE FALSE

karl

Ben Bimber

unread,

Jan 5, 2011, 11:35:26 AM1/5/11

to maduser

too much javacript and perl. thanks.

-ben

Ben Bimber

unread,

Jan 5, 2011, 11:52:32 AM1/5/11

to maduser

Hi Karl,

One more question: In this vector comparison, order matters, and I
don't think %in% is accounting for that:

#sample data
idVec1 <- c('patient1', 'patient2', 'patient3')

idVec3 <- c('patient1', 'patient8', 'patient2')
oldRecords <- data.frame(id=idVec1);
newRecords <- data.frame(id=idVec3);

Now we try:

> compare1 <- oldRecords$id %in% newRecords$id

> compare1
[1] TRUE TRUE FALSE
>

When we want TRUE FALSE FALSE. So I tried this:

> compare2 <- oldRecords$id == newRecords$id
Error in Ops.factor(oldRecords$id, newRecords$id) :
level sets of factors are different
> compare2
logical(0)
>

Then tried:

> compare2 <- as.character(oldRecords$id) == as.character(newRecords$id)

> compare2
[1] TRUE FALSE FALSE

Is performing an '==' comparison with as. character() the right approach?

-Ben

Karl Broman

unread,

Jan 5, 2011, 12:02:28 PM1/5/11

to mad...@googlegroups.com, Ben Bimber

I would paste the two IDs together and then compare them, as follows:

oldids <- paste(oldRecords$id, oldRecords$id2, sep=":")
newids <- paste(newRecords$id, newRecords$id2, sep=":")
oldRecords[!is.na(match(oldids, newids)),]

You might also worry about the order of the two, and so also compare to the switched version:

newidsalt <- paste(newRecords$id2, newRecords$id, sep=":")
oldRecords[!is.na(match(oldids, newids)) | !is.na(match(oldids, newidsalt)) , ]

karl

Ben Bimber

unread,

Jan 5, 2011, 12:09:58 PM1/5/11

to Karl Broman, mad...@googlegroups.com

Hi Karl,

Yes- pasting IDs is better. After I wrote that email I realized that
issue was not order per se, but that Id1 AND Id2 needed to match in
the same row. For example:

Patient1/Patient2
Patient3/Patient4

compared against:

Patient3/Patient4
Patient1/Patient2

should match both rows even though they are out of order. however:

Patient1/Patient2
Patient3/Patient4

and

Patient1/Patient4
Patient3/Patient2

should match zero rows. the above would fail when using '%in%',
unless columns are pasted together.