rhipe_map_buffer_size

15 views
Skip to first unread message

Xiaosu Tong

unread,
Mar 2, 2016, 11:57:35 PM3/2/16
to rhipe
Hi

Recently I was running series simulation trying to understand the rhipe_map_buffer_size parameter.
I first use a rhipe job to simulate 1 GB data on HDFS (using rnorm() function), here I varied the block
size and the value size of each key-value pair of my data saved on HDFS. Then I used the second
rhipe job to read in the data and calculate the length of map.values list. Here I set rhipe_map_buffer_size
to be 10000. Here is What I got

BLK: block size in MB
KV: value size of each key-value pair in log2 MB
mode: the mode of length of map.values
maximum: maximum length of map.values
mem: memory size of map.values with maximum length

    BLK KV mode maximum mem
1   128  0  133     134 134
2   128  1   67      67 134
3   128  2   33      34 136
4   128  3   17      17 136
5   128  4    8       9 144
6   128  5    4       5 160
7   128  6    2       3 192
8   128  7    1       2 256
9   256  0  150     150 150
10  256  1   75      75 150
11  256  2   38      38 152
12  256  3   19      19 152
13  256  4   10      10 160
14  256  5    5       5 160
15  256  6    3       3 192
16  256  7    2       2 256
17 1024  0  150     150 150
18 1024  1   75      75 150
19 1024  2   38      38 152
20 1024  3   19      19 152
21 1024  4   10      10 160
22 1024  5    5       5 160
23 1024  6    3       3 192
24 1024  7    2       2 256

 Apparently, rhipe_map_buffer_size has a cap which is different than 10000 which is what I set. Does anyone know
how this cap is decided?

Thanks

Xiaosu

Saptarshi Guha

unread,
Mar 3, 2016, 12:23:18 AM3/3/16
to rh...@googlegroups.com
1. it ought to be rhipe_map_buff_size which is number of records to read in (defaul 10000)
2. This is balanced by  rhipe_map_bytes_read which is default 150MB
3. In https://github.com/saptarshiguha/RHIPE/blob/master/src/main/C/mapreduce.cc#L289, if the amount of data read is > rhipe_map_bytes_read the mapper is called with map_keys (and map_values) of a shorter length than rhipe_map_buff_size.

also keep in mind that your memory size is different from RHIPE internal rhipe_map_bytes (which uses protobuf encoding of the objects to compte bytes read)

HTH

--

---
You received this message because you are subscribed to the Google Groups "rhipe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhipe+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Saptarshi Guha

unread,
Mar 3, 2016, 12:14:12 PM3/3/16
to Xiaosu Tong, rhipe
IIRC rhipe_reduce_buff_size, also has a default but in the R code.

On Thu, Mar 3, 2016 at 8:40 AM, Xiaosu Tong <xiaos...@gmail.com> wrote:
Hi Saptarsh

Thanks for the info. Then I found
hipe_reduce_bytes_read in

https://github.com/saptarshiguha/RHIPE/blob/master/src/main/C/reducer.cc

but with no default. So I am assuming the length of reduce.values is fully controlled by
rhipe_reduce_buff_size, which default is 100, am I right?

Thanks

Xiaosu

Saptarshi Guha

unread,
Mar 3, 2016, 12:24:59 PM3/3/16
to Xiaosu Tong, rhipe
That and the buffer size. It will read as much as possible till one of those limits is reached: either too many elements or too much of elements

On Thu, Mar 3, 2016 at 9:23 AM, Xiaosu Tong <xiaos...@gmail.com> wrote:
Thanks. So this hipe_map/reduce_bytes_read is actually controlling how large of
input data will be loaded in memory at one time, right?

Thanks

Xiaosu

Saptarshi Guha

unread,
Mar 8, 2016, 5:26:44 PM3/8/16
to Xiaosu Tong, rhipe
Well, IIRC each block is  a java process and it wont send more data than contained inside a block.

but, yes it makes sense that that rhipe_map_bytes_read < block size.

Cheers
Saptarshi

On Mon, Mar 7, 2016 at 12:01 PM, Xiaosu Tong <xiaos...@gmail.com> wrote:
Also I noticed from the experiment that, the rhipe_map_bytes_read should be set less than block size, or the block size should be set larger
than map_bytes_read. For example, if the block size is 128M, and map_bytes_read is 150M, each key-value pair is 1M. A mapper is handling
more data than a block. So some records have been copied to the mapper, which is bad.

Did I understand this correctly?

Thanks

Xiaosu

Reply all
Reply to author
Forward
0 new messages