rhipe_map_buffer

Xiaosu Tong

unread,

Mar 2, 2016, 11:57:35 PM3/2/16

to rhipe

Hi

Recently I was running series simulation trying to understand the rhipe_map_buffer_size parameter.
I first use a rhipe job to simulate 1 GB data on HDFS (using rnorm() function), here I varied the block
size and the value size of each key-value pair of my data saved on HDFS. Then I used the second
rhipe job to read in the data and calculate the length of map.values list. Here I set rhipe_map_buffer_size
to be 10000. Here is What I got

BLK: block size in MB
KV: value size of each key-value pair in log2 MB
mode: the mode of length of map.values
maximum: maximum length of map.values
mem: memory size of map.values with maximum length

    BLK KV mode maximum mem
1   128 0 133     134 134
2   128 1   67      67 134
3   128 2   33      34 136
4   128 3   17      17 136
5   128 4    8       9 144
6   128 5    4       5 160
7   128 6    2       3 192
8   128 7    1       2 256
9   256 0 150     150 150
10 256 1   75      75 150
11 256 2   38      38 152
12 256 3   19      19 152
13 256 4   10      10 160
14 256 5    5       5 160
15 256 6    3       3 192
16 256 7    2       2 256
17 1024 0 150     150 150
18 1024 1   75      75 150
19 1024 2   38      38 152
20 1024 3   19      19 152
21 1024 4   10      10 160
22 1024 5    5       5 160
23 1024 6    3       3 192
24 1024 7    2       2 256

Apparently, rhipe_map_buffer_size has a cap which is different than 10000 which is what I set. Does anyone know
how this cap is decided?

Thanks

Xiaosu

Saptarshi Guha

unread,

Mar 3, 2016, 12:23:18 AM3/3/16

to rh...@googlegroups.com

See

https://github.com/saptarshiguha/RHIPE/blob/master/src/main/C/mapreduce.cc#L244

1. it ought to be rhipe_map_buff_size which is number of records to read in (defaul 10000)

2. This is balanced by rhipe_map_bytes_read which is default 150MB

3. In https://github.com/saptarshiguha/RHIPE/blob/master/src/main/C/mapreduce.cc#L289, if the amount of data read is > rhipe_map_bytes_read the mapper is called with map_keys (and map_values) of a shorter length than rhipe_map_buff_size.

also keep in mind that your memory size is different from RHIPE internal rhipe_map_bytes (which uses protobuf encoding of the objects to compte bytes read)

HTH

--

---
You received this message because you are subscribed to the Google Groups "rhipe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhipe+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Saptarshi Guha

unread,

Mar 3, 2016, 12:14:12 PM3/3/16

to Xiaosu Tong, rhipe

IIRC rhipe_reduce_buff_size, also has a default but in the R code.

See https://github.com/saptarshiguha/RHIPE/blob/master/src/main/R/rhmr.R#L251

On Thu, Mar 3, 2016 at 8:40 AM, Xiaosu Tong <xiaos...@gmail.com> wrote:

Hi Saptarsh

Thanks for the info. Then I found hipe_reduce_bytes_read in

https://github.com/saptarshiguha/RHIPE/blob/master/src/main/C/reducer.cc

but with no default. So I am assuming the length of reduce.values is fully controlled by
rhipe_reduce_buff_size, which default is 100, am I right?

Thanks

Xiaosu

Saptarshi Guha

unread,

Mar 3, 2016, 12:24:59 PM3/3/16

to Xiaosu Tong, rhipe

That and the buffer size. It will read as much as possible till one of those limits is reached: either too many elements or too much of elements

On Thu, Mar 3, 2016 at 9:23 AM, Xiaosu Tong <xiaos...@gmail.com> wrote:

Thanks. So this hipe_map/reduce_bytes_read is actually controlling how large of
input data will be loaded in memory at one time, right?

Thanks

Xiaosu

Saptarshi Guha

unread,

Mar 8, 2016, 5:26:44 PM3/8/16

to Xiaosu Tong, rhipe

Well, IIRC each block is a java process and it wont send more data than contained inside a block.

but, yes it makes sense that that rhipe_map_bytes_read < block size.

Cheers

Saptarshi

On Mon, Mar 7, 2016 at 12:01 PM, Xiaosu Tong <xiaos...@gmail.com> wrote:

Also I noticed from the experiment that, the rhipe_map_bytes_read should be set less than block size, or the block size should be set larger
than map_bytes_read. For example, if the block size is 128M, and map_bytes_read is 150M, each key-value pair is 1M. A mapper is handling
more data than a block. So some records have been copied to the mapper, which is bad.

Did I understand this correctly?

Thanks

Xiaosu

Reply all

Reply to author

Forward