Using a SEDA-like architecture to reduce icache misses

616 views
Skip to first unread message

Avi Kivity

unread,
Mar 9, 2017, 11:42:03 AM3/9/17
to mechanical-sympathy
We noticed in ScyllaDB that performance suffers due to high icache miss
rate; we saw a very low IPC.


We implemented a SEDA-like mechanism to run tasks tied to the same
function sequentially; the first task warms up icache and the branch
predictors, the second and further runs benefit from this warmup.


The implementation of the mechanism in seastar can be viewed here:
https://github.com/scylladb/seastar/commit/384c81ba7227a9a99d485d1bb68c98c5f3a6b209


Usage in ScyllaDB, with some performance numbers, can be viewed here:
https://github.com/scylladb/scylla/commit/efd96a448cca4499fd40df8b3df3f0f8444a1464.
Microbenchmarks see almost 200% improvement while full-system tests see
around 50% improvement, mostly due to improved IPC.



Dan Eloff

unread,
Mar 12, 2017, 3:11:38 PM3/12/17
to mechanica...@googlegroups.com
Two things about this surprised me greatly:

1) That batching function calls using futures like this, with the effect of a hot icache and slightly cooler dcache in just a few parts of ScyllaDB led to a 50% speedup. I would have never thought it would make such a difference.
2) Your codebase was already using futures, so the changes to ScyllaDB itself were minimal. Very nicely done.

I will keep this technique, and Seastar itself, in mind for future applications. Thanks for sharing!





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

agallego

unread,
Mar 13, 2017, 9:21:06 AM3/13/17
to mechanical-sympathy


On Thursday, March 9, 2017 at 11:42:03 AM UTC-5, Avi Kivity wrote:
We noticed in ScyllaDB that performance suffers due to high icache miss
rate; we saw a very low IPC.



I wonder if we can use an implicit execution stage within the future scheduler via tags.  
ala 

template<typename T, typename Tag=void>
future<T>...


maybe a new type? -> tagged_future<Tag, VaridaicTypes...>


Maybe it doesn't work due to the variadic template of the future<>? 

Did you guys play w/ the size of the batches of the function call group... i.e.: how perf changes from say 10 consecutive calls vs 100 vs 1000.

As in, did it have a major negative effect on the IO loop? - just wondering.

Thanks for sharing. I have to add this to my seastar work too. 

I guess you found this more useful around the work on scylla on: Compressing + writing to disk ... or did you have any other insight as to where in scylla it would be most helpful. 

Trying to understand how to apply this to some of my seastar stuff. 


Thanks for sharing. 

Avi Kivity

unread,
Mar 13, 2017, 11:06:36 AM3/13/17
to mechanica...@googlegroups.com



On 03/12/2017 09:11 PM, Dan Eloff wrote:
Two things about this surprised me greatly:

1) That batching function calls using futures like this, with the effect of a hot icache and slightly cooler dcache in just a few parts of ScyllaDB led to a 50% speedup. I would have never thought it would make such a difference.

I was just as surprised.


2) Your codebase was already using futures, so the changes to ScyllaDB itself were minimal. Very nicely done.

I will keep this technique, and Seastar itself, in mind for future applications. Thanks for sharing!

On Thu, Mar 9, 2017 at 8:41 AM, Avi Kivity <a...@scylladb.com> wrote:
We noticed in ScyllaDB that performance suffers due to high icache miss rate; we saw a very low IPC.


We implemented a SEDA-like mechanism to run tasks tied to the same function sequentially; the first task warms up icache and the branch predictors, the second and further runs benefit from this warmup.


The implementation of the mechanism in seastar can be viewed here: https://github.com/scylladb/seastar/commit/384c81ba7227a9a99d485d1bb68c98c5f3a6b209


Usage in ScyllaDB, with some performance numbers, can be viewed here: https://github.com/scylladb/scylla/commit/efd96a448cca4499fd40df8b3df3f0f8444a1464. Microbenchmarks see almost 200% improvement while full-system tests see around 50% improvement, mostly due to improved IPC.



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Avi Kivity

unread,
Mar 13, 2017, 11:10:32 AM3/13/17
to mechanica...@googlegroups.com



On 03/13/2017 03:21 PM, agallego wrote:


On Thursday, March 9, 2017 at 11:42:03 AM UTC-5, Avi Kivity wrote:
We noticed in ScyllaDB that performance suffers due to high icache miss
rate; we saw a very low IPC.



I wonder if we can use an implicit execution stage within the future scheduler via tags.  
ala 

template<typename T, typename Tag=void>
future<T>...


maybe a new type? -> tagged_future<Tag, VaridaicTypes...>


Maybe it doesn't work due to the variadic template of the future<>? 



It's possible to hack something up, but I'm not sure it's worth the compile time increase.  The current interface is simple enough.


Did you guys play w/ the size of the batches of the function call group... i.e.: how perf changes from say 10 consecutive calls vs 100 vs 1000.

I did not.  It's somewhat self tuning in that a too-large batch will get preempted.



As in, did it have a major negative effect on the IO loop? - just wondering.

What do you mean?



Thanks for sharing. I have to add this to my seastar work too.


Check your IPC before.  If it's not too low, there's probably no point.

I'd guess >= 2: don't touch, < 0.5: definitely, anything else is a maybe.  But that's just a guess.



I guess you found this more useful around the work on scylla on: Compressing + writing to disk ... or did you have any other insight as to where in scylla it would be most helpful. 


Mostly where we have complex code running on small work items.

Trying to understand how to apply this to some of my seastar stuff. 


Thanks for sharing. 

 
We implemented a SEDA-like mechanism to run tasks tied to the same
function sequentially; the first task warms up icache and the branch
predictors, the second and further runs benefit from this warmup.


The implementation of the mechanism in seastar can be viewed here:
https://github.com/scylladb/seastar/commit/384c81ba7227a9a99d485d1bb68c98c5f3a6b209


Usage in ScyllaDB, with some performance numbers, can be viewed here:
https://github.com/scylladb/scylla/commit/efd96a448cca4499fd40df8b3df3f0f8444a1464.
Microbenchmarks see almost 200% improvement while full-system tests see
around 50% improvement, mostly due to improved IPC.



Nitsan Wakart

unread,
Mar 15, 2017, 9:46:24 AM3/15/17
to mechanica...@googlegroups.com
This to me is the distilled observation here, so repeating in isolation: "Mostly where we have complex code running on small work items"
Which makes allot of sense.

Henry Robinson

unread,
Mar 16, 2017, 1:33:01 PM3/16/17
to mechanical-sympathy
How small is 'small' in this context? It's interesting to contrast this with the direction of modern analytical databases, where there is a natural pipeline of operators through which a batch of data (maybe 10KB, but with high variance) runs. Rather than execute this pipeline one operator-at-a-time, the most recent work (e.g. [1][2]) in the area recommends doing everything possible to keep the current batch in cache, and sending it as far through its pipeline as it can go without an (application-level) context switch.

The benefit is partly that the cost of moving data between stages is greatly reduced (indeed, the idea is to compile query plans into one giant function with no parameter passing or return overhead), at the cost of losing icache locality across batches (but you get plenty of it intra-batch).  

On 15 March 2017 at 06:42, 'Nitsan Wakart' via mechanical-sympathy <mechanical-sympathy@googlegroups.com> wrote:
This to me is the distilled observation here, so repeating in isolation: "Mostly where we have complex code running on small work items"
Which makes allot of sense.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages