[R] adding rows without loops

Adeel Amin

unread,

May 23, 2013, 1:00:50 AM5/23/13

to r-h...@r-project.org

I'm comparing a variety of datasets with over 4M rows. I've solved this
problem 5 different ways using a for/while loop but the processing time is
murder (over 8 hours doing this row by row per data set). As such I'm
trying to find whether this solution is possible without a loop or one in
which the processing time is much faster.

Each dataset is a time series as such:

DF1:

X.DATE X.TIME VALUE VALUE2
1 01052007 0200 37 29
2 01052007 0300 42 24
3 01052007 0400 45 28
4 01052007 0500 45 27
5 01052007 0700 45 35
6 01052007 0800 42 32
7 01052007 0900 45 32
.
.
.
n

DF2

X.DATE X.TIME VALUE VALUE2
1 01052007 0200 37 29
2 01052007 0300 42 24
3 01052007 0400 45 28
4 01052007 0500 45 27
5 01052007 0600 45 35
6 01052007 0700 42 32
7 01052007 0800 45 32

.
.
n+4000

In other words there are 4000 more rows in DF2 then DF1 thus the datasets
are of unequal length.

I'm trying to ensure that all dataframes have the same number of X.DATE and
X.TIME entries. Where they are missing, I'd like to insert a new row.

In the above example, when comparing DF2 to DF1, entry 01052007 0600 entry
is missing in DF1. The solution would add a row to DF1 at the appropriate
index.

so new dataframe would be

X.DATE X.TIME VALUE VALUE2
1 01052007 0200 37 29
2 01052007 0300 42 24
3 01052007 0400 45 28
4 01052007 0500 45 27
5 01052007 0600 45 27
6 01052007 0700 45 35
7 01052007 0800 42 32
8 01052007 0900 45 32

Value and Value2 would be the same as row 4.

Of course this is simple to accomplish using a row by row analysis but with
of 4M rows the processing time destroying and rebinding the datasets is
very time consuming and I believe highly un-R'ish. What am I missing?

Thanks!

[[alternative HTML version deleted]]

______________________________________________
R-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Blaser Nello

unread,

May 23, 2013, 3:14:53 AM5/23/13

to Adeel Amin, r-h...@r-project.org

Merge should do the trick. How to best use it will depend on what you
want to do with the data after.
The following is an example of what you could do. This will perform
best, if the rows are missing at random and do not cluster.

DF1 <- data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:5,7:9)*100,
VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32))
DF2 <- data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:8)*100,
VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32))

DFm <- merge(DF1, DF2, by=c("X.DATE", "X.TIME"), all=TRUE)

while(any(is.na(DFm))){
if (any(is.na(DFm[1,]))) stop("Complete first row required!")
ind <- which(is.na(DFm), arr.ind=TRUE)
prind <- matrix(c(ind[,"row"]-1, ind[,"col"]), ncol=2)
DFm[is.na(DFm)] <- DFm[prind]
}
DFm

Best,
Nello

-----Original Message-----
From: r-help-...@r-project.org [mailto:r-help-...@r-project.org]
On Behalf Of Adeel Amin
Sent: Donnerstag, 23. Mai 2013 07:01
To: r-h...@r-project.org
Subject: [R] adding rows without loops

I'm comparing a variety of datasets with over 4M rows. I've solved this
problem 5 different ways using a for/while loop but the processing time
is murder (over 8 hours doing this row by row per data set). As such
I'm trying to find whether this solution is possible without a loop or
one in which the processing time is much faster.

Each dataset is a time series as such:

DF1:

X.DATE X.TIME VALUE VALUE2
1 01052007 0200 37 29
2 01052007 0300 42 24
3 01052007 0400 45 28
4 01052007 0500 45 27
5 01052007 0700 45 35
6 01052007 0800 42 32
7 01052007 0900 45 32
...

...
...

n

DF2

X.DATE X.TIME VALUE VALUE2
1 01052007 0200 37 29
2 01052007 0300 42 24
3 01052007 0400 45 28
4 01052007 0500 45 27
5 01052007 0600 45 35
6 01052007 0700 42 32
7 01052007 0800 45 32

...
...

Adeel - SafeGreenCapital

unread,

May 23, 2013, 8:54:26 AM5/23/13

to Blaser Nello, r-h...@r-project.org

Thank you Blaser:

This is the exact solution I came up with but when comparing 8M rows even on
an 8G machine, one runs out of memory. To run this effectively, I have to
break the DF into smaller DFs, loop through them and then do a massive
rmerge at the end. That's what takes 8+ hours to compute.

Even the bigmemory package is causing OOM issues.

Rainer Schuermann

unread,

May 23, 2013, 10:07:00 AM5/23/13

to r-h...@r-project.org

Using the data generated with your code below, does

rbind( DF1, DF2[ !(DF2$X.TIME %in% DF1$X.TIME), ] )
DF1 <- DF1[ order( DF1$X.DATE, DF1$X.TIME ), ]

do the job?

Rgds,
Rainer

William Dunlap

unread,

May 23, 2013, 11:58:32 AM5/23/13

to Adeel - SafeGreenCapital, Blaser Nello, r-h...@r-project.org

> This is the exact solution I came up with ...

"exact", really?

Is the time-consuming part the initial merge

DFm <- merge(DF1, DF2, by=c("X.DATE", "X.TIME"), all=TRUE)

or the postprocessing to turn runs of NAs into the last non-NA
value in the column

while(any(is.na(DFm))){
if (any(is.na(DFm[1,]))) stop("Complete first row required!")
ind <- which(is.na(DFm), arr.ind=TRUE)
prind <- matrix(c(ind[,"row"]-1, ind[,"col"]), ncol=2)
DFm[is.na(DFm)] <- DFm[prind]
}

If it is the latter, you may get better results from applying zoo::na.locf()
to each non-key column of DFm. E.g.,
library(zoo)
f2 <- function(DFm) {
for(i in 3:length(DFm)) {
DFm[[i]] <- na.locf(DFm[[i]])
}
DFm
}
f(DFm)
gives the same result as Blaser's algorithm
f1 <- function (DFm) {
while (any(is.na(DFm))) {

if (any(is.na(DFm[1, ])))
stop("Complete first row required!")
ind <- which(is.na(DFm), arr.ind = TRUE)
prind <- matrix(c(ind[, "row"] - 1, ind[, "col"]), ncol = 2)
DFm[is.na(DFm)] <- DFm[prind]
}
DFm
}

If there are not a huge number of columns I would guess that f2() would be much
faster.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

Adeel Amin

unread,

May 24, 2013, 12:39:11 AM5/24/13

to r-h...@r-project.org

Rainer...I can't believe this did the trick. You're a genius. Thank you
sir.

Reply all

Reply to author

Forward