Setting mapper count / split count via mapreduce.job.maps

62 views
Skip to first unread message

Marek Bejda

unread,
Mar 26, 2015, 12:00:38 PM3/26/15
to rh...@googlegroups.com
Hello All! 

   I finally got RHIPE working with version 0.75 and started initial benchmarking on datasets. I am able to set the number of reducers using mapreduce.job.reduce=5
   in

mapred = list(
           mapred.task.timeout=0
            , mapreduce.job.maps=num_mappers #CDH5
            , mapreduce.job.reduces= num_reducers #CDH5
       )

rhipe.results <- rhwatch(
                        map=mapper, reduce=reducer,
                        input=rhfmt(input.file.hdfs, type="text"),
                        output=output.dir.hdfs,
                        jobname=paste("rhipe-",num_mappers,num_reducers,input.file.name,sep="-"),
                        mapred=mapred)

However it does not split the file and set the mappers, it always uses the default number, or block size amount of mappers. 

The wordcount script I am running can be found here.... 
https://github.com/marek5050/Sparkie/blob/master/RHIPE/wordcount.sh


Saptarshi Guha

unread,
Mar 26, 2015, 12:03:11 PM3/26/15
to rh...@googlegroups.com
That is because it uses the hadoop class for reading text files which
splits according to the number of blocks. You can increase the maps,
by changing the block size of the job e.g.

mapred.max.split.size = as.integer(1024*1024*64)
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "rhipe" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to rhipe+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Marek Bejda

unread,
Mar 29, 2015, 1:46:03 PM3/29/15
to rh...@googlegroups.com, saptars...@gmail.com
Awesome it worked!! Thank you! So RHIPE doesn't actually use hadoop streaming? Would you know where I can find some information about how RHIPE actually works? Besides the source code :) 

Saptarshi Guha

unread,
Mar 30, 2015, 1:05:02 PM3/30/15
to Marek Bejda, rh...@googlegroups.com
No it doesn't use streaming. Though it does use the same principle:
write from hadoop to C via stdout, and back via stdin and
errors/messages through stdcount.
As for how it works ,..., feel free to ask questions apart from the
source. We can provide detailed explanations here.
Reply all
Reply to author
Forward
0 new messages