Intelligent processor allocation with "embarrassingly parallel" linear regression loop

19 views
Skip to first unread message

Derek Nixon

unread,
Sep 25, 2015, 12:13:27 PM9/25/15
to Davis R Users' Group

Hi all,

 

I am trying to run regressions, one per U.S. state, none of which need to interact with the others at all (i.e. "embarrassingly parallel") that has an inefficiently long runtime I'm hoping someone can help me with. Some regressions within the loop take significantly longer than others, dictated by how many observations are in the regression (the difference in # of observations / rows in the dataframe for each state can be by a factor of 10). As I run the parallel loop, I have seen in my (OSX Yosemite) Activity Monitor that three processors are running for far longer than the other five, and one processor runs even longer than the other seven. I know in advance which regressions will take the longest, but I haven't been able to figure out how to communicate this to R, so R isn’t targeting the longest regressions (most observations) to start right away on dedicated processors, while other processors take more smaller regressions. Right now, I'm doing 12 state regressions on 8 cores (and it takes 1.8 to 2.2 hours to run), but I will soon be doing 48 state regressions with a more complex standard error correction (i.e. longer runtime per regression) and the problem could get significantly worse. This is also likely applicable to future research I will conduct.

 

My code uses the “parallel,” “doMC” and “foreach” packages with respect to parallel loops, and the “lmtest” and “sandwich” packages with respect to linear regression analysis (serious autocorrelation so I’m doing a Newey-West correction for many lags). My code looks like this:

 

#####

# preparing for looping:

registerDoMC(8) # I have an 8 core computer

getDoParRegistered() # confirms regressions are running in parallel

statecodes <- sort(unique(rdata2$state)) # regressions by state of U.S., dataframe is rdata2

codes  <- numeric(length(statecodes)) # I’m guessing this is the step to improve

 

# measurement of loop and recording output:

ptm <- proc.time()

sink(“output.txt”) # I just need a text file of the output

 

foreach (i = seq_along(statecodes)) %dopar% {

gm <- lm( arithmetic_mean ~ … , data=rdata2,rdata2$state == statecodes[i]) # specification omitted, gm arbitrary name

coef <- coeftest(gm, df = Inf, vcov = NeweyWest(gm, lag =  15, prewhite = FALSE)) # lag 15 is the cause of the lengthy runtime, but will expand this further

# coeftest comes from the lmtest and sandwich packages

name <- cbind(as.data.frame.matrix(coef),statecodes[i])

print(name)

}

 

sink() # shutting down “output.txt” recording

proc.time() - ptm # generally around 2 hours

 #####


I’m new to R so apologies if my answer exists elsewhere on the web, I’ve seen some verbal addressing of intelligent processor allocation, but can’t find actual code I can use. I’m using RStudio Version 0.98.1103 on OSX Yosemite (Macintosh; Intel Mac OS X 10_10_5) according to the “about RStudio” page.


Best!


Derek Nixon

Ag and Resource Economics

Jaime Ashander

unread,
Sep 25, 2015, 12:45:25 PM9/25/15
to davi...@googlegroups.com
Try ordering the vector `statecodes` by the number of observations, decreasing. Then (I think) foreach will try to start the models with the most observations earlier.

 

--
Check out our R resources at http://www.noamross.net/davis-r-users-group.html
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com.
Visit this group at http://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages