Can I get access to the details of the counters produced by increment.counter

19 views
Skip to first unread message

FEI

unread,
Mar 18, 2015, 11:21:00 AM3/18/15
to rha...@googlegroups.com

Hi All,

I find that the increment.counter function is really useful when try to get some basic descriptive results. I was wondering if the details of the counters are stored somewhere during the running process such that I can get access to them to do the further analysis.

Thanks,
Fei

Antonio Piccolboni

unread,
Mar 18, 2015, 11:24:32 AM3/18/15
to rha...@googlegroups.com
They are but the relevant API is not available in rmr2. Agreed it would be nice to have, but because of the way rmr2 communicates with Hadoop (hadoop streaming) it's not something we can easily add.

Saar Golde

unread,
Mar 19, 2015, 8:39:40 PM3/19/15
to rha...@googlegroups.com
I've been interested in this for a while, and I think I made some progress. Not a full solution but maybe a step in the right direction, that (with some more work) can become a solution:

1. Once a job is over, the command line 'mapred job -status [[job_id]]' yields the full status message for a specific job - the job file, the tracking URL, and all the counters. I'm not sure how standard the counters are, but if they are standard everything that's non-standard is probably a custom counter. Alternatively, knowing what counter names to look for may be useful for some regex matches. 

2. Looks like the job id is part of the object returned by the mapreduce function when in non-verbose mode. So after some additional digging and some trial and error, this seems to be working fine:

myData <- to.dfs(1:1000)

myOut2 <- mapreduce(
  input = myData, 
  map = function(k, v) {return(keyval(key = v, val = v^2))}, 
  verbose = FALSE)

jobId <- attr(x = myOut2, which = "job.id")

system(paste(rmr2:::hadoop.cmd(), "job -status", attr(x = myOut2, which = "job.id")))

It may require some further string manipulation, but at the very least the counters (and other outputs) can be made available as part of something we can use programmatically... 

I'm not sure how feasible it would be to make the job ID available as an attribute of the output in verbose mode, but I think it would definitely be useful. 

Hope this helps, 
-Saar




--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Antonio Piccolboni

unread,
Mar 20, 2015, 7:55:37 PM3/20/15
to RHadoop Google Group
This looks like could be made into a new feature, thanks Saar!

FEI

unread,
Mar 22, 2015, 9:57:45 PM3/22/15
to rha...@googlegroups.com, ant...@piccolboni.info

Thank you all for the inputs!

Best,
Fei
Reply all
Reply to author
Forward
0 new messages