Hi All,
I am testing a few network performance optimization techniques for Seastar using the standard POSIX networking stack. Right now I am using SO_ATTACH_REUSEPORT_CBPF to enforce perfect locality[1]. In order for this to work, there has to be certain guarantees about the order in which sockets are opened, and the corresponding CPUs to which threads are pinned.
I tried to control it like this:
auto range = boost::irange<int>(0, smp::count);
return do_for_each(range, [server = std::move(server), port] (auto i) {
return server->invoke_on(i, &tcp_server::listen, ipv4_addr{port});
});
As far as I can tell this should invoke the listen() function on a each shard in the desired order, however I wasn't seeing the performance boost that I expected. As it turns out, it was because the shard_id wasn't matching the CPU id to which the shard is pinned.
I confirmed using a bpftrace script that attached kprobes to reuseport_alloc() and reuseport_add_sock().
tcp_httpd_demo, cpu=0, socket 0
reactor-1, socket 1
, cpu=2
reactor-2, socket 2
, cpu=1
reactor-3, socket 3, cpu=3
I dug into the Seastar code further and saw that it is using hwloc to understand the hardware topology and optimize accordingly. I am running my test on a 4vCPU instance on AWS, and this is the partial output of lstopo:
NUMANode L#0 (P#0 10100MB)
L3 L#0 (25MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#2)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#3)
My assumption is that hwloc treats cpu0 as the first CPU and cpu2 as the second since they are on separate physical cores. This ensures maximum performance in scenarios where not all CPUs are used, but it messes things up for me.
First I disabled hwloc by passing --disable-hwloc to configure.py. Performance improved and bpftrace showed the expected output:
tcp_httpd_demo, cpu=0, socket 0
reactor-1, socket 1, cpu=1
reactor-2, socket 2, cpu=2
reactor-3, socket 3, cpu=3
But of course this isn't a good strategy since we would lose all the other hwloc benefits.
Next I did a quick hack and modified smp::configure() in reactor.cc to call smp::pin(i) instead of smp::pin(allocation.cpu_id). This also worked and allowed me to continue my testing, but it is also not a good solution.
Next I searched for ways to tell hwloc to stick to strict logical cpu ordering, but so far I
haven't found anything.
Another approach would be to modify my code to call invoke_on() such that the socket
ordering matches the CPU ids. So I would have to call invoke_on() using shard_ids 0, 2, 1, and finally 3. This would require
me to be able to determine ahead of time which shard is pinned to which CPU. Is there an
easy way to do that?