randomForest in RHadoop

42 views
Skip to first unread message

Anuja Ranjan

unread,
Sep 26, 2013, 9:18:31 AM9/26/13
to rha...@googlegroups.com

Hey 
I am using a master node with 6 slave nodes to run a randomforest mapreduce task on a data of nearly 2.7 million entries. I randomly take a chunk of size 2% of the set and learn a randomforest. This cycle is done 50 times. However when I take 5% of the set I run into error where my map tasks start getting killed and ultimately the process is also killed. There is no memory constraint as a lot of memory is still free while the process executes
Any help on this?
Anuja

Antonio Piccolboni

unread,
Sep 26, 2013, 11:08:53 AM9/26/13
to RHadoop Google Group
It could be a time out. You need to dig into the logs to figure out what it is exactly. You could also put some debugging calls in your program to see where it fails. If you suspect call xyz() to cause the problem, you could add:

rmr.str("about to call xyz")
xyz()
rmr.str("done with xyz")

Then you can examine the stderr logs to see if the failure happens before or after xyz.
To avoid time outs you just need to call status("some message") every few seconds. If the time out happens in a library call you can't modify and if it has an option to write a progress bar or some such, you can turn that on while at the same time surrounding it  with a sink call as follows

sink(stderr())
long_library_call(...., verbose = TRUE)
sink(NULL)

I hope this helps.

Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages