How Jobtracker invoke tasktracker in R-client

Dhanasekaran Anbalagan

unread,

Mar 15, 2013, 10:06:05 AM3/15/13

to rha...@googlegroups.com

Hi Guys,

I want to know, How actually job tracker invoke tasktracker R client, can i do some extra argument invoke with tasktracker process.

like http://cran.r-project.org/doc/manuals/r-release/R-intro.html#Invoking-R

--min-vsize=N--min-nsize=NFor expert use only: set the initial trigger sizes for garbage collection of vector heap (in bytes) and cons cells (number) respectively. Suffix ‘M’ specifies megabytes or millions of cells respectively. The defaults are 6Mb and 350k respectively.
--max-ppsize=NSpecify the maximum size of the pointer protection stack as N locations. This defaults to 10000, but can be increased to allow large and complicated calculations to be done. Currently the maximum value accepted is 100000.
--max-mem-size=N(Windows only) Specify a limit for the amount of memory to be used both for R objects and working areas. This is set by default to the smaller of the amount of physical RAM in the machine and for 32-bit R, 1.5Gb24, and must be between 32Mb and the maximum allowed on that version of Windows.

Guys can you explain How to Rhadoop job will dispatch with Rclient machines [likely tasktracker].

Please guide me.

-Dhanasekaran

Antonio Piccolboni

unread,

Mar 15, 2013, 11:56:08 AM3/15/13

to RHadoop Google Group

Hi,

the implementation of the hadoop backend is based on hadoop streaming. Rscript with a number of arguments is supplied as the -mapper, -reducer and -combiner options (the latter two when necessary) accepted by hadoop streaming. It is conceivable to provide additional arguments and options to Rscript, so if it also accepts the options you are interested in setting, then it can be done, but not in the current version of rmr. Please enter an issue in the issue tracker and give a little bit of motivation or a use case for this request. Thanks

Antonio

-Dhanasekaran

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Antonio Piccolboni

unread,

Mar 18, 2013, 11:59:30 AM3/18/13

to rha...@googlegroups.com

Thanks for the additional information. I entered issue #27 to track development of this feature. I described it as a general ability to pass options to Rscript, which is what is invoked by hadoop on the task nodes. Could you please verify that Rscript accepts the options you need on the cmd line? Thanks

Antonio

On Monday, March 18, 2013 2:53:11 AM UTC-7, Rajaneesh Madanagopal wrote:

Hi Antoio,
Thanks for the response. Let me throw more light into the issue that we are facing.

We have a 3 machine beta hadoop cluster, where we use Rhadoop. The issue we are facing with the beta cluster and a larger 50 machine hadoop cluster with R streaming job is that the R, code which is compute intensive and memory intensive, seems to be garbage collecting. These machines are 32 core machines, and what we find is that the system cpu% goes to 70% where as the user cpu% is low. When profiled, using "perf", i noticed that there was lot of spinlock within the kernel which is inducing the system cpu% to go up massively, there be rendering the machine to be less effective for other jobs.

On further debugging using "gdb", i noticed that the R program was the cause of the issue, It seems to be garbage collecting quite frequently. See below stack trace of one such instance. We do know that the code does lot of rbind [ without proper data type mapping], and lots of memory allocation in a tight loop , but before attempting to fix the code, wanted to spawn the R program with "--min-vsize=512M --max-vsize=1G --min-nsize=500k --max-nsize=1M" and see if that reduces the system cpu%. So is there a way to set this currently ? . Let me know if you need more inputs.

Thanks
Raj Madanagopal
Deep Value

[Thread 0x7f18f05f7700 (LWP 25405) exited]

Breakpoint 1, RunGenCollect (size_needed=353916) at memory.c:1510
1510 PROCESS_NODES();
(gdb) where
#0 RunGenCollect (size_needed=353916) at memory.c:1510
#1 R_gc_internal (size_needed=353916) at memory.c:2579
#2 0x000000000042248b in Rf_allocVector (type=14, length=353916) at memory.c:2353
#3 0x00007f18f6f67c63 in extract_col (x=0x7f18d3532010, j=0x70bfe60, drop=0x4795668, first_=<optimized out>, last_=0x78d9c98) at extract_col.c:31
#4 0x000000000053edfd in do_dotcall (call=0x4694f40, op=<optimized out>, args=<optimized out>, env=<optimized out>) at dotcode.c:549
#5 0x0000000000573cac in Rf_eval (e=0x4694f40, rho=0x78ca848) at eval.c:492
#6 0x00000000005768d4 in do_return (call=0x4694ed0, op=<optimized out>, args=<optimized out>, rho=0x78ca848) at eval.c:1430
#7 0x0000000000573b29 in Rf_eval (e=0x4694ed0, rho=0x78ca848) at eval.c:466
#8 0x0000000000573b29 in Rf_eval (e=0x4694db8, rho=0x78ca848) at eval.c:466
#9 0x0000000000576960 in do_begin (call=0x46a58a0, op=0x1d53940, args=0x4694d80, rho=0x78ca848) at eval.c:1413
#10 0x0000000000573b29 in Rf_eval (e=0x46a58a0, rho=0x78ca848) at eval.c:466
#11 0x0000000000577b7f in Rf_applyClosure (call=0x78cad18, op=0x46a6290, arglist=<optimized out>, rho=0x78cadc0, suppliedenv=0x78cad88) at eval.c:859
#12 0x000000000042dd42 in Rf_usemethod (generic=0x696699 "[", obj=0x7fff87117a30, call=<optimized out>, args=<optimized out>, rho=0x78cadc0, callrho=0x78caf48, defrho=0x1d7aa98, ans=0x7fff87117e78)
at objects.c:363

# Another instance

0x00000000004205fb in RunGenCollect (size_needed=58798) at memory.c:1510
1510 PROCESS_NODES();

Antonio Piccolboni

unread,

Mar 18, 2013, 3:33:59 PM3/18/13

to rha...@googlegroups.com

The change is in the dev branch and you can test it building from there. Two of the options you want to specify are no longer supported by R, so nothing we can do about that. This change is targeted for 2.2.0

Rajaneesh Madanagopal

unread,

Mar 26, 2013, 5:16:22 AM3/26/13

to rha...@googlegroups.com

Thanks Antonio.

The Dev version of R seems to be having these options back. I will give it a shot for the rest of the options.

Antonio Piccolboni

unread,

Apr 16, 2013, 1:50:39 PM4/16/13

to RHadoop Google Group

It looks like min vsize and min nsize can be controlled by env variables as well. It seems to me we don't have a good case for this feature and I was a little too fast to add it. I think I am going to pull it before release because once you expand the API, you need to support it or a major release to undo the changes. For now you can use something like (in 3.0.0)

mapreduce(to.dfs(1:10), backend.parameters = list(hadoop = list(cmdenv = "R_NSIZE")))

Please build directly from master or wait for 2.2.0. There was a subtle bug in the ordering of options that made the above command fail in 2.1

Let me know if it works for you.

Antonio

Antonio Piccolboni

unread,

Apr 16, 2013, 2:00:17 PM4/16/13

to rha...@googlegroups.com

Let me clarify, for the above to work you need rmr2.2 (or current master, just give me ten minutes to finish off some changes) and R 3.0.0 It works but I don't see any effect with R2.15 and I don't have R 3.0 installed, so you are going to be the trailblazer, I suspect.

Antonio

Reply all

Reply to author

Forward