rbind removing duplicates?

1,689 views
Skip to first unread message

Mark Knecht

unread,
Jul 15, 2009, 9:20:54 PM7/15/09
to Bay Area R Helpers
Hi,
Just wondering if there is a packaged function som where that can
do what rbind does for me but automatically removed duplicate rows if
there are any?

I'd like to merge old data with new data but am concerned that
there may at times be some rows that are repeated. If there are then
I'd like to remove them as long as they are exact duplicates.

Currently I merge different portions of trading systems - say longs
with shorts - where the longs and shorts are in diffrerent files. I
use code that sort of goes something like this:

rbind(longs,short)
order to get them shuffeled back together in date order
tradeNumber = 1:dim[NewArray)[1]

I can probably write something but it would be nice to use
something already in the libraries.

Thanks,
Mark

Daniel Levine

unread,
Jul 15, 2009, 9:32:23 PM7/15/09
to Mark Knecht, Bay Area R Helpers
if the rows have ids, then after rbind() you can use match, something like:

new_df is the new dataframe
id is the id column

result <- new_df[match(unique(new_df$id),new_df$id),]

hope that helps, hope there are no typos there
basically its getting a unique list of ids and then using match which will return a vector of rows of the first appearances of those unique ids

then subset the dataframe with the row numbers

Mark Knecht

unread,
Jul 15, 2009, 9:42:27 PM7/15/09
to Daniel Levine, Bay Area R Helpers
Daniel,
Thanks. It may help. I'll look through the examples and see if
anything rings true for me.

There is a unique ID but it's spread across 4 columns:

Entry Date
Entry Time
Exit Date
Exit Time

I suppose I could do something like paste(Entry Date,Entry
Time,Exit Date,Exit Time,sep=" ") and try matching those as unique? I
get a little worried about white space or caps/no caps possibly
causing a problem sometime down the road, but it's unlikely to be a
problem right now.

Thanks!

- Mark

Mark Knecht

unread,
Jul 15, 2009, 10:22:04 PM7/15/09
to Daniel Levine, Bay Area R Helpers
Seems to work fine Daniel. As a test I just used the same system
twice. That means the intermediate df is twice as large as the
originals and then after removing common id's it's back to the
original size.

Thanks for the ideas!

Cheers,
Mark


> System1 = read.csv("C:\\Track_Trades\\Klamath_Long_Track_Trades.csv",header=TRUE)
> System2 = read.csv("C:\\Track_Trades\\Klamath_Long_Track_Trades.csv",header=TRUE)
>
> dim(System1)
[1] 412 425
> dim(System2)
[1] 412 425
>
> SystemNew = rbind(System1, System2)
> dim(SystemNew)
[1] 824 425
> SystemNew$MyID = paste(SystemNew$EnDate,SystemNew$EnTime,SystemNew$ExDate,SystemNew$ExTime, sep=" ")
> dim(SystemNew)
[1] 824 426
>
> head(SystemNew$MyID)
[1] "1040107 915 1040107 1300" "1040108 909 1040108 1300" "1040115
921 1040115 1300"
[4] "1040120 1134 1040120 1300" "1040121 923 1040121 1300" "1040205
1043 1040205 1300"
>
> result = SystemNew[match(unique(SystemNew$MyID),SystemNew$MyID),]
>
> dim(result)
[1] 412 426
Reply all
Reply to author
Forward
0 new messages