Strange behavior of Seastar memory allocator

283 views
Skip to first unread message

Michael Shulbaev

<chulupey@gmail.com>
unread,
Oct 24, 2018, 3:17:36 AM10/24/18
to seastar-dev
I was allocating big temporary_buffer and noticed Seastar did not allow me to use most of my memory.

In short, free memory reported by seastar::memory::stats().free_memory() is roughly around my total memory divided by shard count. But maximum size of allocated buffer is that value divided by shard count again. Of 16G memory on my machine, I can allocate 4G with one shard (one core), 2G with two shards, 2G with 3 shards, 1G with 4 shards.

Expected behavior: I can allocate buffer of the same size reported by free_memory(), or about my total memory size divided by shard count

Actual behavior: I can allocate buffer of size about total memory size divided by shard count squared.

The test program could be found at https://github.com/chulup/ext_sort/blob/master/src/memory_test.cpp ; here are results of runs with different shard amount requested:


$ ./mem_test -c1
WARN  2018-10-24 13:54:12,414 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
free_memory: 13.88 G, smp::count: 1
free_memory/smp::count : 13.88 G
Allocating buffers starting from size 0.50 G with increment 0.50 G
Got bad_alloc on size 4.50 G
Allocating buffers starting from size 4.50 G with decrement 0.06 G
Successfully allocated buffer of size 4.00 G
Maximum buffer size is 4.00 G

$ ./mem_test -c2
WARN  2018-10-24 13:54:14,988 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
free_memory: 6.94 G, smp::count: 2
free_memory/smp::count : 3.47 G
Allocating buffers starting from size 0.50 G with increment 0.50 G
Got bad_alloc on size 2.50 G
Allocating buffers starting from size 2.50 G with decrement 0.06 G
Successfully allocated buffer of size 2.00 G
Maximum buffer size is 2.00 G

$ ./mem_test -c3
WARN  2018-10-24 13:54:17,597 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
free_memory: 4.60 G, smp::count: 3
free_memory/smp::count : 1.53 G
Allocating buffers starting from size 0.50 G with increment 0.50 G
Got bad_alloc on size 2.50 G
Allocating buffers starting from size 2.50 G with decrement 0.06 G
Successfully allocated buffer of size 2.00 G
Maximum buffer size is 2.00 G

$ ./mem_test -c4
WARN  2018-10-24 13:54:19,522 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
free_memory: 3.47 G, smp::count: 4
free_memory/smp::count : 0.87 G
Allocating buffers starting from size 0.50 G with increment 0.50 G
Got bad_alloc on size 1.50 G
Allocating buffers starting from size 1.50 G with decrement 0.06 G
Successfully allocated buffer of size 1.00 G
Maximum buffer size is 1.00 G



Avi Kivity

<avi@scylladb.com>
unread,
Oct 24, 2018, 3:43:44 AM10/24/18
to Michael Shulbaev, seastar-dev


On 24/10/2018 10.17, Michael Shulbaev wrote:
I was allocating big temporary_buffer and noticed Seastar did not allow me to use most of my memory.

In short, free memory reported by seastar::memory::stats().free_memory() is roughly around my total memory divided by shard count. But maximum size of allocated buffer is that value divided by shard count again. Of 16G memory on my machine, I can allocate 4G with one shard (one core), 2G with two shards, 2G with 3 shards, 1G with 4 shards.

Expected behavior: I can allocate buffer of the same size reported by free_memory(), or about my total memory size divided by shard count

Actual behavior: I can allocate buffer of size about total memory size divided by shard count squared.


Can you try with 8 and 16 shards? Might need --overprovisioned.


The test program could be found at https://github.com/chulup/ext_sort/blob/master/src/memory_test.cpp ; here are results of runs with different shard amount requested:


You can use the "scylla memory" gdb command from https://github.com/scylladb/scylla/blob/master/scylla-gdb.py (we should move the seastar stuff into seastar.git) to see how memory was fragmented.


In general, you should not rely on large allocations as they cannot work reliably in a long-running server, but it would be nice to allocate almost all of memory on startup. But I think you hit an edge case:


 - a 4GB shard actually has less than 4GB, because some reserve is left for the OS

 - some memory is allocated by Seastar, at lower addresses


So the memory map looks like this:


   [allocated slabs] [free slabs up to 1GB boundary] [1GB slab] [1GB slab] [free slabs up to 4GB-epsilon boundary]


Seastar uses buddy allocation (since 33d8f74fc83a12601618a4a9fa7ad1f3a9955c73); that means a 2GB allocation has to be 2GB aligned. There isn't such a slab in a 4GB-epsilon shard.


Try running with --memory 9G --smp 2 or --memory 13G --smp 3 and you should see 2GB allocations succeed.



--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/81def54c-5b64-4d13-a517-5f1515af4ca7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

chulupey@gmail.com

<chulupey@gmail.com>
unread,
Oct 24, 2018, 5:02:47 AM10/24/18
to seastar-dev
Thanks for help!

The case is closed, memory allocator works as it is implemented, not as I thought it would be. More on that is written in-line.

On Wednesday, October 24, 2018 at 2:43:44 PM UTC+7, Avi Kivity wrote:


On 24/10/2018 10.17, Michael Shulbaev wrote:
I was allocating big temporary_buffer and noticed Seastar did not allow me to use most of my memory.

In short, free memory reported by seastar::memory::stats().free_memory() is roughly around my total memory divided by shard count. But maximum size of allocated buffer is that value divided by shard count again. Of 16G memory on my machine, I can allocate 4G with one shard (one core), 2G with two shards, 2G with 3 shards, 1G with 4 shards.

Expected behavior: I can allocate buffer of the same size reported by free_memory(), or about my total memory size divided by shard count

Actual behavior: I can allocate buffer of size about total memory size divided by shard count squared.


Can you try with 8 and 16 shards? Might need --overprovisioned.


Sure

 $ ./mem_test -c8 --overprovisioned
WARN  2018-10-24 15:07:44,272 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
free_memory: 1.73 G, smp::count: 8
free_memory/smp::count : 0.22 G

Allocating buffers starting from size 0.50 G with increment 0.50 G
Got bad_alloc on size 1.00 G
Allocating buffers starting from size 1.00 G with decrement 0.06 G
Successfully allocated buffer of size 0.50 G
Maximum buffer size is 0.50 G

$ ./mem_test -c16 --overprovisioned
WARN  2018-10-24 15:09:11,673 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
free_memory: 0.86 G, smp::count: 16
free_memory/smp::count : 0.05 G

Allocating buffers starting from size 0.50 G with increment 0.50 G
Got bad_alloc on size 0.50 G
Allocating buffers starting from size 0.50 G with decrement 0.06 G
Successfully allocated buffer of size 0.25 G
Maximum buffer size is 0.25 G

The test program could be found at https://github.com/chulup/ext_sort/blob/master/src/memory_test.cpp ; here are results of runs with different shard amount requested:


You can use the "scylla memory" gdb command from https://github.com/scylladb/scylla/blob/master/scylla-gdb.py (we should move the seastar stuff into seastar.git) to see how memory was fragmented.


In general, you should not rely on large allocations as they cannot work reliably in a long-running server, but it would be nice to allocate almost all of memory on startup. But I think you hit an edge case:


 - a 4GB shard actually has less than 4GB, because some reserve is left for the OS

 - some memory is allocated by Seastar, at lower addresses


So the memory map looks like this:


   [allocated slabs] [free slabs up to 1GB boundary] [1GB slab] [1GB slab] [free slabs up to 4GB-epsilon boundary]


Seastar uses buddy allocation (since 33d8f74fc83a12601618a4a9fa7ad1f3a9955c73); that means a 2GB allocation has to be 2GB aligned. There isn't such a slab in a 4GB-epsilon shard.


That explains the issue quite nicely. I changed the code to allocate and keep multiple buffers and result corroborate that fact.

I think it should be mentioned somewhere in docs, because it took me some time to be sure I can't take "free_memory" statistic for granted and start testing maximum buffer size.
Reply all
Reply to author
Forward
0 new messages