This is a bug we came across here at an obscure search company that I thought people would be interested in and/or may be able to help with:
The cpu has a known problem of reloading a byte/short/int/long right after it has been written.
Here in GRO, the issue is that NAPI_GRO_CB(skb)->same_flow is written (at the end of skb_gro_receive()) a few cycles before CLANG reads again the byte containing NAPI_GRO_CB(skb)->free.
The dev_gro_receive() stall:
mov $0x5,%r12d
cmp $0xffffffffffffff8d,%rbx
je 3bd
37.63 522: movzwl 0x4a(%r15),%r14d // High cost reading something already in cache
xor %r12d,%r12d
1.08 test $0xc0,%r14b
1.79 setne %al
test %rbx,%rbx
18.69% [kernel] [k] gq_rx_alloc_page
14.19% [kernel] [k] dev_gro_receive // Only with CLANG=1 we can see such high cost
13.02% [kernel] [k] gq_rx_napi_handler
7.95% [kernel] [k] tcp_gro_receive
7.32% [kernel] [k] __direct_call_packet_offload_callbacks_gro_receive1
5.34% [kernel] [k] skb_gro_receive
│ 000000000022b880 <clear_b1>:
│ clear_b1():
3.97 │ callq __fentry__
│ push %rbp
3.77 │ mov %rsp,%rbp
90.62 │ andb $0xfe,(%rdi) // byte access
1.63 │ pop %rbp
│ retq
While the iter() stuff uses word access :
0.89 │ and $0x1,%r12d
│ mov %r14,%rdi
1.16 │ callq clear_b1
33.36 │ mov 0x2bfd90(%r13),%ecx // very high penalty
0.10 │ mov %ecx,%edx
1.89 │ shr $0x8,%edx
3.59 │ add %ebx,%edx
1.37 │ and $0x1,%edx
│ mov %edx,%eax
1.72 │ shl $0x8,%eax
0.83 │ and $0xfffffeff,%ecx
1.21 │ or %eax,%ecx
4.24 │ mov %ecx,0x2bfd90(%r13)
0.02 │ addl $0x1,0x2bfd94(%r13)
│ mov %r12d,%eax
2.28 │ shl $0x18,%eax
│ test %edx,%edx
0.02 │ je 20
8.72 │ and $0xfeffffff,%ecx
0.69 │ or %eax,%ecx
2.66 │ mov %ecx,(%r14)
│ jmpq 20
Another very high cost with CLANG is the skb->l4_hash setting, done with again a read of a 32bit quantity in order to set a 2bit bitfield.
skb_set_hash(skb, be32_to_cpu(desc->rss_hash), gq_rss_type(desc->flags_seq));
mov 0x80(%r13),%ecx // 40% of cpu cycles in gq_rx_napi_handler() !
mov $0xfffffcff,%esi
and %esi,%ecx
or %edx,%ecx
mov %cx,0x80(%r13)
Again, this high cost (reading 32bits) is because of a prior write of 16bits in 0x80(%r13):
mov 0x80(%r13),%eax
and $0xffffff9f,%eax
or $0x40,%eax
mov %ax,0x80(%r13) // writing 16bits !
This prior sequence is about skb->ip_summed being set: skb->ip_summed = CHECKSUM_COMPLETE;
Here is what gcc emits (no mix of bytes/word access)
765: 0f b6 83 80 00 00 00 movzbl 0x80(%rbx),%eax // byte load
76c: 83 e0 9f and $0xffffff9f,%eax
76f: 83 c8 40 or $0x40,%eax
772: 88 83 80 00 00 00 mov %al,0x80(%rbx) // byte write
...
79e: 0f b6 83 81 00 00 00 movzbl 0x81(%rbx),%eax // byte load
7a5: 41 8b 56 08 mov 0x8(%r14),%edx
7a9: 40 0f 95 c6 setne %sil
7ad: 83 e0 fc and $0xfffffffc,%eax
7b0: 0f ca bswap %edx
7b2: 09 f0 or %esi,%eax
7b4: 89 93 94 00 00 00 mov %edx,0x94(%rbx)
7ba: 88 83 81 00 00 00 mov %al,0x81(%rbx) // byte write
Replication program is attached. Some results:
$ perf stat -r10 -e cycles clang-bitfield:
8,709,304,936 cycles:u ( +- 0.34% )
2.8238 +- 0.0131 seconds time elapsed ( +- 0.46% )
$ perf stat -r10 -e cycles clang-bitfield-word:
8,059,274,548 cycles:u ( +- 0.13% )
2.6271 +- 0.0118 seconds time elapsed ( +- 0.45% )
$ perf stat -r10 -e cycles gcc-bitfield:
7,841,119,839 cycles:u ( +- 0.25% )
2.5847 +- 0.0151 seconds time elapsed ( +- 0.58% )