Out of curiosity are you using C++ or Java, and does the behaviour change with 1.0 of Aeron?
In addition to Philip's comments on a performance model there are some things worth exploring:
- Turbo Boost: With increased active cores the clock rate can go down. Using x86 PAUSE in spin loops can help but best to frequency lock all cores.
- Bandwidth Limitations: If all cores are accessing the same L3 cache slice then the port on that slice can become a bottleneck. Cache coherence traffic for invalidation and then re-fetching of all cores needs to be considered as the publisher gets exclusive access before modification.
Are you seeing back pressure on the publisher? If so, then you likely waiting for the publisher flow control window to be updated. This could either the driver conductor or one or more of the subscribers being starved out and thus holding everyone else back.
There is so much to look at. Cache missing, starvation, setup for NUMA and CoD (Cluster on Die - effectively NUMA on socket). Best to have a model of what you expect and then measure what is being observed and see if the experimental evidence fits the model. You need to model the flow rates and dependencies. To have parallel in-flight cache misses you need to ensure you are avoiding data dependent loads and even then you only have 10 line feed buffers per core to keep the cache misses operating concurrently. If you graph the scale up then limitations like bandwidth, buffers, etc. become obvious as the queuing effects kick in.
Martin...