[llvm-dev] ROCm module from LLVM AMDGPU backend

301 views
Skip to first unread message

Frank Winter via llvm-dev

unread,
Apr 22, 2020, 4:57:48 PM4/22/20
to LLVM Dev
Hi,

I'm trying to launch a GPU kernel which was compiled by the LLVM
AMDGPU backend. Currently I'm having no success with it and I was
hoping someone tuned in on here might have an idea.

It seems that tensorflow is doing a similar thing. So I was reading
the tensorflow code on github and I believe the following setup is
pretty close in the vital parts:


1) Compile an LLVM IR module (see below) with AMDGPU backend to a
'module.o' file. Using this triple/CPU:

   llvm::Triple TheTriple;
   TheTriple.setArch (llvm::Triple::ArchType::amdgcn);
   TheTriple.setVendor (llvm::Triple::VendorType::AMD);
   TheTriple.setOS (llvm::Triple::OSType::AMDHSA);

   std::string CPUStr("gfx906");

   LLVM IR passes that I use:

   TargetLibraryInfoWrapperPass
   TargetMachine->addPassesToEmitFile with CGFT_ObjectFile


2) LLVM linker generates a shared lib using 'system()' call

   ld.lld -shared module.o -o module.so


3) Reading this shared module back into a 'vector<uint8> shared'


4) Using HIP to load this module:

    hipModule_t module;
    ret = hipModuleLoadData( &module , shared.data() );

   (this returns hipSuccess)

5) Trying to get a HIP function:

    hipFunction_t kernel;
    ret = hipModuleGetFunction(&kernel, module, "kernel" );

.. and this fails with HIP error code 500 !?


I believe the vital steps here concerning ROCm are similar
(identical?) to what's in tensorflow but I don't get it to work.

I have to admit that I did not build tensorflow to see if the AMD GPU
bits actually work. I read the comments and some are saying that it
comes with some performance overhead. Performance isn't the point at
the moment - I'm working on a proof-of-concept.

My test machine has an 'AMD gfx906' card installed.

Digging deeper, the hipModule_t is a pointer to ihipModule_t and
printing out the values after loading the module gives

ihip->fileName =
ihip->hash = 3943538976062281088
ihip->kernargs.size() = 0
ihip->executable.handle = 42041072

It's not telling me much. 'Not sure what to do with the handle for the
executable.

Any ideas what could be tried next?


Frank


--------------------------------------------------------------

LLVM IR module

target datalayout =
"e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"

define void @kernel(i1 %arg0, i32 %arg1, i32 %arg2, i32 %arg3, i1 %arg4,
i32* %arg5, i1* %arg6, float* %arg7, float* %arg8, float* %arg9) {
entrypoint:
  %0 = sext i1 %arg4 to i32
  %1 = xor i32 -1, %0
  %2 = call i32 @llvm.amdgcn.workitem.id.x()
  %3 = icmp sge i32 %2, %arg1
  br i1 %3, label %L0, label %L1

L0:                                               ; preds = %entrypoint
  ret void

L1:                                               ; preds = %entrypoint
  %4 = trunc i32 %1 to i1
  br i1 %4, label %L3, label %L4

L2:                                               ; preds = %L6, %L5, %L4
  %5 = phi i32 [ %7, %L4 ], [ %8, %L5 ], [ %2, %L6 ]
  br i1 %arg0, label %L7, label %L8

L3:                                               ; preds = %L1
  br i1 %arg0, label %L5, label %L6

L4:                                               ; preds = %L1
  %6 = getelementptr i32, i32* %arg5, i32 %2
  %7 = load i32, i32* %6
  br label %L2

L5:                                               ; preds = %L3
  %8 = add nsw i32 %2, %arg2
  br label %L2

L6:                                               ; preds = %L3
  br label %L2

L7:                                               ; preds = %L2
  %9 = icmp sgt i32 %5, %arg3
  br i1 %9, label %L12, label %L13

L8:                                               ; preds = %L2
  %10 = getelementptr i1, i1* %arg6, i32 %5
  %11 = load i1, i1* %10
  %12 = sext i1 %11 to i32
  %13 = xor i32 -1, %12
  %14 = trunc i32 %13 to i1
  br i1 %14, label %L10, label %L11

L9:                                               ; preds = %L15, %L11
  %15 = add nsw i32 0, %5
  %16 = add nsw i32 0, %5
  %17 = getelementptr float, float* %arg8, i32 %16
  %18 = load float, float* %17
  %19 = add nsw i32 0, %5
  %20 = getelementptr float, float* %arg9, i32 %19
  %21 = load float, float* %20
  %22 = fmul float %18, %21
  %23 = getelementptr float, float* %arg7, i32 %15
  store float %22, float* %23
  ret void

L10:                                              ; preds = %L8
  ret void

L11:                                              ; preds = %L8
  br label %L9

L12:                                              ; preds = %L7
  ret void

L13:                                              ; preds = %L7
  %24 = icmp slt i32 %5, %arg2
  br i1 %24, label %L14, label %L15

L14:                                              ; preds = %L13
  ret void

L15:                                              ; preds = %L13
  br label %L9
}

; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.amdgcn.workitem.id.x() #0

attributes #0 = { nounwind readnone speculatable }

------------------------------------------------------------------------------

The following is the assembly output the AMDGPU backend generates:


output:     .text
    .amdgcn_target "amdgcn-amd-amdhsa--gfx906"
    .globl    kernel
    .p2align    2
    .type    kernel,@function
kernel:
    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
    v_and_b32_e32 v4, 1, v4
    v_cmp_eq_u32_e64 s[4:5], 1, v4
    v_and_b32_e32 v0, 1, v0
    v_and_b32_e32 v4, 0x3ff, v15
    v_cmp_eq_u32_e32 vcc, 1, v0
    v_cmp_lt_i32_e64 s[6:7], v4, v1
    s_and_saveexec_b64 s[8:9], s[6:7]
    s_cbranch_execz BB0_16
BB0_1:
    s_and_saveexec_b64 s[6:7], s[4:5]
    s_xor_b64 s[6:7], exec, s[6:7]
    s_cbranch_execz BB0_3
BB0_2:
    v_lshlrev_b32_e32 v0, 2, v4
    v_add_co_u32_e64 v0, s[4:5], v5, v0
    v_addc_co_u32_e64 v1, s[4:5], 0, v6, s[4:5]
    flat_load_dword v0, v[0:1]
BB0_3:
    s_or_saveexec_b64 s[4:5], s[6:7]
    s_xor_b64 exec, exec, s[4:5]
    s_cbranch_execz BB0_7
BB0_4:
    s_xor_b64 s[6:7], vcc, -1
    s_waitcnt vmcnt(0) lgkmcnt(0)
    v_add_u32_e32 v0, v4, v2
    s_and_saveexec_b64 s[10:11], s[6:7]
    s_xor_b64 s[6:7], exec, s[10:11]
BB0_5:
    v_mov_b32_e32 v0, v4
BB0_6:
    s_or_b64 exec, exec, s[6:7]
BB0_7:
    s_or_b64 exec, exec, s[4:5]
    s_xor_b64 s[6:7], vcc, -1
    s_mov_b64 s[4:5], 0
    s_and_saveexec_b64 s[10:11], s[6:7]
    s_xor_b64 s[6:7], exec, s[10:11]
    s_cbranch_execz BB0_9
BB0_8:
    s_waitcnt vmcnt(0) lgkmcnt(0)
    v_ashrrev_i32_e32 v1, 31, v0
    v_add_co_u32_e32 v4, vcc, v7, v0
    v_addc_co_u32_e32 v5, vcc, v8, v1, vcc
    flat_load_ubyte v1, v[4:5]
    s_waitcnt vmcnt(0) lgkmcnt(0)
    v_and_b32_e32 v1, 1, v1
    v_cmp_eq_u32_e32 vcc, 1, v1
    s_and_b64 s[4:5], vcc, exec
BB0_9:
    s_or_saveexec_b64 s[6:7], s[6:7]
    s_xor_b64 exec, exec, s[6:7]
    s_cbranch_execz BB0_13
BB0_10:
    s_waitcnt vmcnt(0) lgkmcnt(0)
    v_cmp_le_i32_e32 vcc, v0, v3
    s_mov_b64 s[12:13], s[4:5]
    s_and_saveexec_b64 s[10:11], vcc
BB0_11:
    v_cmp_ge_i32_e32 vcc, v0, v2
    s_andn2_b64 s[12:13], s[4:5], exec
    s_and_b64 s[14:15], vcc, exec
    s_or_b64 s[12:13], s[12:13], s[14:15]
BB0_12:
    s_or_b64 exec, exec, s[10:11]
    s_andn2_b64 s[4:5], s[4:5], exec
    s_and_b64 s[10:11], s[12:13], exec
    s_or_b64 s[4:5], s[4:5], s[10:11]
BB0_13:
    s_or_b64 exec, exec, s[6:7]
    s_and_saveexec_b64 s[6:7], s[4:5]
    s_cbranch_execz BB0_15
BB0_14:
    s_waitcnt vmcnt(0) lgkmcnt(0)
    v_ashrrev_i32_e32 v1, 31, v0
    v_lshlrev_b64 v[0:1], 2, v[0:1]
    v_add_co_u32_e32 v2, vcc, v11, v0
    v_addc_co_u32_e32 v3, vcc, v12, v1, vcc
    flat_load_dword v4, v[2:3]
    v_add_co_u32_e32 v2, vcc, v13, v0
    v_addc_co_u32_e32 v3, vcc, v14, v1, vcc
    flat_load_dword v2, v[2:3]
    v_add_co_u32_e32 v0, vcc, v9, v0
    v_addc_co_u32_e32 v1, vcc, v10, v1, vcc
    s_waitcnt vmcnt(0) lgkmcnt(0)
    v_mul_f32_e32 v2, v4, v2
    flat_store_dword v[0:1], v2
BB0_15:
    s_or_b64 exec, exec, s[6:7]
BB0_16:
    s_or_b64 exec, exec, s[8:9]
    s_waitcnt vmcnt(0) lgkmcnt(0)
    s_setpc_b64 s[30:31]
.Lfunc_end0:
    .size    kernel, .Lfunc_end0-kernel

    .section    ".note.GNU-stack"
    .amdgpu_metadata
---
amdhsa.kernels:  []
amdhsa.version:
  - 1
  - 0
...

    .end_amdgpu_metadata

-----------------------------------------------------------------------


rocminfo output:


Agent 1 and 2 are the host's Intel CPUs, then Agent 3 - 6 look like:

*******
Agent 3
*******
  Name:                    gfx906
  Marketing Name:          Vega 20
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          4096(0x1000)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
  Chip ID:                 26273(0x66a1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1725
  BDFID:                   35328
  Internal Node ID:        2
  Compute Unit:            60
  SIMDs per CU:            4
  Shader Engines:          4
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      FALSE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    33538048(0x1ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Acessible by all:        FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    33538048(0x1ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Acessible by all:        FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Acessible by all:        FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx906
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x 1024(0x400)
        y 1024(0x400)
        z 1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x 4294967295(0xffffffff)
        y 4294967295(0xffffffff)
        z 4294967295(0xffffffff)
      FBarrier Max Size:       32


_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Arsenault, Matthew via llvm-dev

unread,
Apr 22, 2020, 5:12:54 PM4/22/20
to LLVM Dev, Frank Winter

[AMD Official Use Only - Internal Distribution Only]


Your "@kernel" function isn't a kernel, it's the default C calling convention. You need to use the amdgpu_kernel calling convention

-Matt

From: llvm-dev <llvm-dev...@lists.llvm.org> on behalf of Frank Winter via llvm-dev <llvm...@lists.llvm.org>
Sent: Wednesday, April 22, 2020 1:57 PM
To: LLVM Dev <llvm...@lists.llvm.org>
Subject: [llvm-dev] ROCm module from LLVM AMDGPU backend
 
[CAUTION: External Email]
Reply all
Reply to author
Forward
0 new messages