Interesting Code Generation Bug

4 views

Skip to first unread message

Bill Wendling

unread,

May 6, 2020, 3:50:53 PM5/6/20

to clang-built-linux

This is a bug we came across here at an obscure search company that I  thought people would be interested in and/or may be able to help with:

The cpu has a known problem of reloading a byte/short/int/long right after it has been written.

Here in GRO, the issue is that NAPI_GRO_CB(skb)->same_flow is written (at the end of skb_gro_receive()) a few cycles before CLANG reads again the byte containing NAPI_GRO_CB(skb)->free.

The dev_gro_receive() stall:

               mov    $0x5,%r12d
               cmp    $0xffffffffffffff8d,%rbx
               je     3bd
  37.63 522:   movzwl 0x4a(%r15),%r14d  // High cost reading something already in cache
               xor    %r12d,%r12d
   1.08        test   $0xc0,%r14b
   1.79        setne  %al
               test   %rbx,%rbx

   18.69%  [kernel]          [k] gq_rx_alloc_page
   14.19%  [kernel]          [k] dev_gro_receive  // Only with CLANG=1 we can see such high cost
   13.02%  [kernel]          [k] gq_rx_napi_handler
    7.95%  [kernel]          [k] tcp_gro_receive
    7.32%  [kernel]          [k] __direct_call_packet_offload_callbacks_gro_receive1
    5.34%  [kernel]          [k] skb_gro_receive



       │   000000000022b880 <clear_b1>:
       │   clear_b1():
  3.97 │     callq  __fentry__
       │     push   %rbp
  3.77 │     mov    %rsp,%rbp
 90.62 │     andb   $0xfe,(%rdi) // byte access
  1.63 │     pop    %rbp
       │     retq

While the iter() stuff uses word access :

  0.89 │      and    $0x1,%r12d
       │      mov    %r14,%rdi
  1.16 │      callq  clear_b1
 33.36 │      mov    0x2bfd90(%r13),%ecx  // very high penalty
  0.10 │      mov    %ecx,%edx
  1.89 │      shr    $0x8,%edx
  3.59 │      add    %ebx,%edx
  1.37 │      and    $0x1,%edx
       │      mov    %edx,%eax
  1.72 │      shl    $0x8,%eax
  0.83 │      and    $0xfffffeff,%ecx
  1.21 │      or     %eax,%ecx
  4.24 │      mov    %ecx,0x2bfd90(%r13)
  0.02 │      addl   $0x1,0x2bfd94(%r13)
       │      mov    %r12d,%eax
  2.28 │      shl    $0x18,%eax
       │      test   %edx,%edx
  0.02 │      je     20
  8.72 │      and    $0xfeffffff,%ecx
  0.69 │      or     %eax,%ecx
  2.66 │      mov    %ecx,(%r14)
       │      jmpq   20

Another very high cost with CLANG is the skb->l4_hash setting, done with again a read of a 32bit quantity in order to set a 2bit bitfield.

skb_set_hash(skb, be32_to_cpu(desc->rss_hash), gq_rss_type(desc->flags_seq));

mov    0x80(%r13),%ecx  // 40% of cpu cycles in gq_rx_napi_handler() !
mov    $0xfffffcff,%esi
and    %esi,%ecx
or     %edx,%ecx
mov    %cx,0x80(%r13)

Again, this high cost (reading 32bits) is because of a prior write of 16bits in 0x80(%r13):

mov    0x80(%r13),%eax
and    $0xffffff9f,%eax
or     $0x40,%eax
mov    %ax,0x80(%r13)   // writing 16bits !

This prior sequence is about skb->ip_summed being set: skb->ip_summed = CHECKSUM_COMPLETE;

Here is what gcc emits (no mix of bytes/word access)

     765: 0f b6 83 80 00 00 00 movzbl 0x80(%rbx),%eax  // byte load
     76c: 83 e0 9f             and    $0xffffff9f,%eax
     76f: 83 c8 40             or     $0x40,%eax
     772: 88 83 80 00 00 00     mov    %al,0x80(%rbx) // byte write

...
     79e: 0f b6 83 81 00 00 00 movzbl 0x81(%rbx),%eax  // byte load
     7a5: 41 8b 56 08           mov    0x8(%r14),%edx
     7a9: 40 0f 95 c6           setne  %sil
     7ad: 83 e0 fc             and    $0xfffffffc,%eax
     7b0: 0f ca                 bswap  %edx
     7b2: 09 f0                 or     %esi,%eax
     7b4: 89 93 94 00 00 00     mov    %edx,0x94(%rbx)
     7ba: 88 83 81 00 00 00     mov    %al,0x81(%rbx)  // byte write

Replication program is attached. Some results:

$ perf stat -r10 -e cycles clang-bitfield:
    8,709,304,936      cycles:u ( +-  0.34% )
    2.8238 +- 0.0131 seconds time elapsed  ( +-  0.46% )

$ perf stat -r10 -e cycles clang-bitfield-word:
   8,059,274,548      cycles:u  ( +-  0.13% )
   2.6271 +- 0.0118 seconds time elapsed  ( +-  0.45% )

$ perf stat -r10 -e cycles gcc-bitfield:
    7,841,119,839      cycles:u  ( +-  0.25% )
    2.5847 +- 0.0151 seconds time elapsed  ( +-  0.58% )

bitfield.c

clear.c

makefile

bitfield.h

craig....@gmail.com

unread,

May 9, 2020, 2:39:51 AM5/9/20

to Clang Built Linux

Few things I noticed trying to poke around

The clang frontend is emitting the load/store for l4_hash and sw_hash as an i16 despite it being stored in __u8. Not sure how clang decides the types for bitfields.

The X86 backend really likes to turn i16 and i8 loads that are least 4 byte aligned into i32 loads unless the load is volatile. This helps with folding loads into operations especially when operations are promoted from 16 bits. But we seem to do it even for loads we aren't folding. It's a 1 byte encoding savings for i8, but doesn't save any bytes for i16. We also prefer to widen a 4 byte aligned load so we can fold it into an instruction even when there's an immediate to fold. Unless the immediate is [-128, 127].

I think when I looked at some of this back in March. I also spotted some interaction with isTypeDesirableForOp or IsDesirableToPromoteOp in DAG Combiner. Those are two of the functions we use to indicate that i16 is to be avoided due to the extra 0x66 prefix in the encoding. I can't remember the specifics of how it came into play here.

Reply all

Reply to author

Forward

0 new messages