Interesting Code Generation Bug

4 views
Skip to first unread message

Bill Wendling

unread,
May 6, 2020, 3:50:53 PM5/6/20
to clang-built-linux
This is a bug we came across here at an obscure search company that I thought people would be interested in and/or may be able to help with:

The cpu has a known problem of reloading a byte/short/int/long right after it has been written. Here in GRO, the issue is that NAPI_GRO_CB(skb)->same_flow is written (at the end of skb_gro_receive()) a few cycles before CLANG reads again the byte containing NAPI_GRO_CB(skb)->free. The dev_gro_receive() stall: mov $0x5,%r12d cmp $0xffffffffffffff8d,%rbx je 3bd 37.63 522: movzwl 0x4a(%r15),%r14d // High cost reading something already in cache xor %r12d,%r12d 1.08 test $0xc0,%r14b 1.79 setne %al test %rbx,%rbx 18.69% [kernel] [k] gq_rx_alloc_page 14.19% [kernel] [k] dev_gro_receive // Only with CLANG=1 we can see such high cost 13.02% [kernel] [k] gq_rx_napi_handler 7.95% [kernel] [k] tcp_gro_receive 7.32% [kernel] [k] __direct_call_packet_offload_callbacks_gro_receive1 5.34% [kernel] [k] skb_gro_receive │ 000000000022b880 <clear_b1>: │ clear_b1(): 3.97 │ callq __fentry__ │ push %rbp 3.77 │ mov %rsp,%rbp 90.62 │ andb $0xfe,(%rdi) // byte access 1.63 │ pop %rbp │ retq While the iter() stuff uses word access : 0.89 │ and $0x1,%r12d │ mov %r14,%rdi 1.16 │ callq clear_b1 33.36 │ mov 0x2bfd90(%r13),%ecx // very high penalty 0.10 │ mov %ecx,%edx 1.89 │ shr $0x8,%edx 3.59 │ add %ebx,%edx 1.37 │ and $0x1,%edx │ mov %edx,%eax 1.72 │ shl $0x8,%eax 0.83 │ and $0xfffffeff,%ecx 1.21 │ or %eax,%ecx 4.24 │ mov %ecx,0x2bfd90(%r13) 0.02 │ addl $0x1,0x2bfd94(%r13) │ mov %r12d,%eax 2.28 │ shl $0x18,%eax │ test %edx,%edx 0.02 │ je 20 8.72 │ and $0xfeffffff,%ecx 0.69 │ or %eax,%ecx 2.66 │ mov %ecx,(%r14) │ jmpq 20 Another very high cost with CLANG is the skb->l4_hash setting, done with again a read of a 32bit quantity in order to set a 2bit bitfield. skb_set_hash(skb, be32_to_cpu(desc->rss_hash), gq_rss_type(desc->flags_seq)); mov 0x80(%r13),%ecx // 40% of cpu cycles in gq_rx_napi_handler() ! mov $0xfffffcff,%esi and %esi,%ecx or %edx,%ecx mov %cx,0x80(%r13) Again, this high cost (reading 32bits) is because of a prior write of 16bits in 0x80(%r13): mov 0x80(%r13),%eax and $0xffffff9f,%eax or $0x40,%eax mov %ax,0x80(%r13) // writing 16bits ! This prior sequence is about skb->ip_summed being set: skb->ip_summed = CHECKSUM_COMPLETE; Here is what gcc emits (no mix of bytes/word access) 765: 0f b6 83 80 00 00 00 movzbl 0x80(%rbx),%eax // byte load 76c: 83 e0 9f and $0xffffff9f,%eax 76f: 83 c8 40 or $0x40,%eax 772: 88 83 80 00 00 00 mov %al,0x80(%rbx) // byte write ... 79e: 0f b6 83 81 00 00 00 movzbl 0x81(%rbx),%eax // byte load 7a5: 41 8b 56 08 mov 0x8(%r14),%edx 7a9: 40 0f 95 c6 setne %sil 7ad: 83 e0 fc and $0xfffffffc,%eax 7b0: 0f ca bswap %edx 7b2: 09 f0 or %esi,%eax 7b4: 89 93 94 00 00 00 mov %edx,0x94(%rbx) 7ba: 88 83 81 00 00 00 mov %al,0x81(%rbx) // byte write
Replication program is attached. Some results:

$ perf stat -r10 -e cycles clang-bitfield: 8,709,304,936 cycles:u ( +- 0.34% ) 2.8238 +- 0.0131 seconds time elapsed ( +- 0.46% ) $ perf stat -r10 -e cycles clang-bitfield-word: 8,059,274,548 cycles:u ( +- 0.13% ) 2.6271 +- 0.0118 seconds time elapsed ( +- 0.45% ) $ perf stat -r10 -e cycles gcc-bitfield: 7,841,119,839 cycles:u ( +- 0.25% ) 2.5847 +- 0.0151 seconds time elapsed ( +- 0.58% )
bitfield.c
clear.c
makefile
bitfield.h

craig....@gmail.com

unread,
May 9, 2020, 2:39:51 AM5/9/20
to Clang Built Linux
Few things I noticed trying to poke around

The clang frontend is emitting the load/store for l4_hash and sw_hash as an i16 despite it being stored in __u8. Not sure how clang decides the types for bitfields.

The X86 backend really likes to turn i16 and i8 loads that are least 4 byte aligned into i32 loads unless the load is volatile. This helps with folding loads into operations especially when operations are promoted from 16 bits. But we seem to do it even for loads we aren't folding. It's a 1 byte encoding savings for i8, but doesn't save any bytes for i16. We also prefer to widen a 4 byte aligned load so we can fold it into an instruction even when there's an immediate to fold. Unless the immediate is [-128, 127].

I think when I looked at some of this back in March. I also spotted some interaction with isTypeDesirableForOp or IsDesirableToPromoteOp in DAG Combiner. Those are two of the functions we use to indicate that i16 is to be avoided due to the extra 0x66 prefix in the encoding. I can't remember the specifics of how it came into play here.
Reply all
Reply to author
Forward
0 new messages