Controlling/Knowing the relationship between shard_id and cpu id

Skip to first unread message

Marc Richards

Feb 24, 2022, 9:04:54 PM2/24/22
to seastar-dev
Hi All,

I am testing a few network performance optimization techniques for Seastar using the standard POSIX networking stack. Right now I am using SO_ATTACH_REUSEPORT_CBPF to enforce perfect locality[1]. In order for this to work, there has to be certain guarantees about the order in which sockets are opened, and the corresponding CPUs to which threads are pinned.

I tried to control it like this:

auto range = boost::irange<int>(0, smp::count);
    return do_for_each(range, [server = std::move(server), port] (auto i) {
      return server->invoke_on(i, &tcp_server::listen, ipv4_addr{port});

As far as I can tell this should invoke the listen() function on a each shard in the desired order, however I wasn't seeing the performance boost that I expected. As it turns out, it was because the shard_id wasn't matching the CPU id to which the shard is pinned. 

I confirmed using a bpftrace script that attached kprobes to reuseport_alloc() and reuseport_add_sock().

tcp_httpd_demo, cpu=0, socket 0
reactor-1, socket 1 , cpu=2
reactor-2, socket 2 , cpu=1
reactor-3, socket 3, cpu=3

I dug into the Seastar code further and saw that it is using hwloc to understand the hardware topology and optimize accordingly. I am running my test on a 4vCPU instance on AWS, and this is the partial output of lstopo:

    NUMANode L#0 (P#0 10100MB)
    L3 L#0 (25MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#2)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#3)

My assumption is that hwloc treats cpu0 as the first CPU and cpu2 as the second since they are on separate physical cores. This ensures maximum performance in scenarios where not all CPUs are used, but it messes things up for me.

First I disabled hwloc by passing --disable-hwloc to Performance improved and bpftrace showed the expected output:

tcp_httpd_demo, cpu=0, socket 0
reactor-1, socket 1, cpu=1
reactor-2, socket 2, cpu=2
reactor-3, socket 3, cpu=3

But of course this isn't a good strategy since we would lose all the other hwloc benefits. 
Next I did a quick hack and modified smp::configure() in to call smp::pin(i) instead of smp::pin(allocation.cpu_id). This also worked and allowed me to continue my testing, but it is also not a good solution.

Next I searched for ways to tell hwloc to stick to strict logical cpu ordering, but so far I
haven't found anything.

Another approach would be to modify my code to call invoke_on() such that the socket
ordering matches the CPU ids. So I would have to call invoke_on() using shard_ids 0, 2, 1, and finally 3. This would require
me to be able to determine ahead of time which shard is pinned to which CPU. Is there an
easy way to do that?

Avi Kivity

Feb 25, 2022, 11:17:26 AM2/25/22
to Marc Richards, seastar-dev

There's no guarantee that a shard will be pinned to a cpu, see --overprovisioned.

It makes sense to be able to derive the mapping of shards to CPUs. Another use case is to allocate large read-only structures per L3 cache, rather than per-shard, to reduce L3 pollution.

So we can add "reactor::hardware_cpu_mapping() const -> std::optional<unsigned>" that returns nullopt if there is no pinning, or the cpu id if there is pinning. Maybe it's more useful to return a map, so perhaps

    std::optional<std::vector<unsigned>> smp::hardware_cpu_mapping()

returning nullopt if we are not using pinning.

You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit

Marc Richards

Feb 25, 2022, 11:32:56 AM2/25/22
to seastar-dev
Yes, that would be perfect for my use case. And I also agree that returning a map would be more useful.
Reply all
Reply to author
0 new messages