Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1007832: vulkan-tools: vulkaninfo,vkcube seg-faults on head-less cuda/gpu machine

363 views
Skip to first unread message

Alois Schlögl

unread,
Mar 17, 2022, 9:50:03 AM3/17/22
to


Package: vulkan-tools
Version: 1.2.162.0+dfsg1-1
Severity: normal
X-Debbugs-Cc: alois.s...@ist.ac.at

Dear Maintainer,

*** Reporter, please consider answering these questions, where
appropriate ***

   * What led up to the situation?

     Trying to run wine-staging on a GPU/cuda compute node failed
because vulkan was not correctly configured.
     see also https://bugs.winehq.org/show_bug.cgi?id=52647
     In the course of investigation, I noticed that vulkaninfo did
always seg-fault on headless cuda/GPU machines under Debian11.


   * What exactly did you do (or not do) that was effective (or
     ineffective)?

     I've tried to install and uninstall various graphics/vulkan driver.
     Testing a machine without GPU's did not show this issue.

     Here is the output when running the debugger on vulkaninfo

    $ gdb vulkaninfo
    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from vulkaninfo...
    (No debugging symbols found in vulkaninfo)
    (gdb) r
    Starting program: /usr/bin/vulkaninfo
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".

    Program received signal SIGSEGV, Segmentation fault.
    0x000015554ceb23fc in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan_intel.so
    (gdb) Quit
    quit)

     The output of these commands is attached

        $ vulkaninfo
        $ vkcube
        $ vkcubepp

        $ lspci|grep VGA
        $ nvidia-smi

        $ ls -altr  /usr/share/vulkan/icd.d/
        $ cat  /usr/share/vulkan/icd.d/*

        $ dpkg -l|grep "icd\|xorg\|nvid\|cuda\|mesa\|vulkan"

        $ glxinfo

   * What was the outcome of this action?

     vulkaninfo, vkcube always ended up in a seg-fault when running on
Cuda/GPU machines.
     Doing this on machines without Nvidia GPU's does not show this issue.


   * What outcome did you expect instead?
     I understand that Cuda is proprietary technology and therefore
this platform is difficult to support.

     I'd expect that vulkaninfo does not seg-fault, and finishes
gracefully with a more or less meaning full error message.
     Ideally, I should indicate what is needed to fix this issue.

     Afterall, it should be possible to run vulkaninfo on a headless
GPU/cuda machine, that should enable also
     runing wine-staging on that machine.


*** End of the template - remove these template lines ***


-- System Information:
Debian Release: 11.2
  APT prefers stable
  APT policy: (990, 'stable'), (500, 'stable-security')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-12-amd64 (SMP w/24 CPU threads)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE,
TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE
not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages vulkan-tools depends on:
ii  libc6               2.31-13+deb11u2
ii  libgcc-s1           10.2.1-6
ii  libstdc++6          10.2.1-6
ii  libvulkan1          1.2.162.0-1
ii  libwayland-client0  1.18.0-2~exp1.1
ii  libx11-6            2:1.7.2-1
ii  libxcb1             1.14-3

vulkan-tools recommends no packages.

vulkan-tools suggests no packages.

-- no debconf information

Alois Schlögl

unread,
Mar 28, 2022, 4:20:04 PM3/28/22
to


When checking for possible causes, I noticed that the framebuffer is now
(in Debian11) associated with the first gpu. This seems wrong as this is
a headless machine, and the nvidia-gpu's are used only as a
computational accelerator for cuda workloads, and not for visualization.

This is shown with the command
   lshw -C display

which has this entry
   logical name: /dev/fb0

The full log is attached.

When running Debian10, this entry is not shown, and vulkaninfo was working.

Is there an (easy?) way to disable /dev/fb0 ?
lshw-display.log

Alois Schlögl

unread,
Mar 29, 2022, 12:50:03 PM3/29/22
to


Further investigation shows that part of the issue is package
"mesa-vulkan-drivers"
and also on machines without nvidia-gpus.


When uninstalling  nvidia-drivers, and mesa-vulkan-drivers, vulkan-tools
reports correctly an error that no vulkan device is found
When installing mesa-vulkan-driver, and having the nvidia-driver in
place, "gdb vulkaninfo" results in this backtrace


Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get
'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD
libGLX_nvidia.so.0
ERROR: [Loader Message] Code 0 :
/usr/lib/i386-linux-gnu/libvulkan_intel.so: wrong ELF class: ELFCLASS32
ERROR: [Loader Message] Code 0 :
/usr/lib/i386-linux-gnu/libvulkan_radeon.so: wrong ELF class: ELFCLASS32
ERROR: [Loader Message] Code 0 :
/usr/lib/i386-linux-gnu/libvulkan_lvp.so: wrong ELF class: ELFCLASS32

Program received signal SIGSEGV, Segmentation fault.
0x000015554aa4d3fc in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan_intel.so
(gdb) bt
#0  0x000015554aa4d3fc in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan_intel.so
#1  0x000015554aa4ea65 in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan_intel.so
#2  0x0000155554bef7e7 in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan.so.1
#3  0x0000155554befc22 in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan.so.1
#4  0x000015554a96fde4 in ?? () from
/usr/lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
#5  0x0000155554bef2a3 in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan.so.1
#6  0x0000155554bf1e05 in vkEnumeratePhysicalDevices () from
/usr/lib/x86_64-linux-gnu/libvulkan.so.1
#7  0x00005555555b5639 in ?? ()
#8  0x00005555555638b6 in ?? ()
#9  0x0000155554fead0a in __libc_start_main (main=0x555555563670,
argc=1, argv=0x7fffffffe458, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffe448) at
../csu/libc-start.c:308
#10 0x000055555556580a in ?? ()



When having mesa-vulkan-drivers installed, and nvidia drivers are
disabled (e.g. with "rmmod nvidia-drm nvidia-settings"), or on machines
without nvidia-gpu's,
vulkan-tools shows a lot of properties, but fails also with a
segmentation fault.

Running "gdb vulkaninfo", I get this backtrace:

Thread 1 "vulkaninfo" received signal SIGSEGV, Segmentation fault.
0x00007fffcd2f52bc in ?? () from /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so
(gdb) bt
#0  0x00007fffcd2f52bc in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan_lvp.so
#1  0x00007fffcd2f57f1 in ?? () from
/usr/lib/x86_64-linux-gnu/libvulkan_lvp.so
#2  0x00005555555a7140 in ?? ()
#3  0x00005555555a123e in ?? ()
#4  0x0000555555564565 in ?? ()
#5  0x00007fffd6fa2d0a in __libc_start_main (main=0x555555563670,
argc=1, argv=0x7fffffffda98, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffda88) at
../csu/libc-start.c:308
#6  0x000055555556580a in ?? ()


Though the behavior is slightly different, both libraries
libvulkan_intel.so and libvulkan_lvp.so, are part of the
mesa-vulkan-drivers package.

So this indicates that this is a bug in package mesa-vulkan-drivers.

Alois Schlögl

unread,
Sep 5, 2022, 8:30:05 AM9/5/22
to

Control: retitle -1 vulkaninfo,vkcube seg-faults on remote access.

The issue is unrelated to cuda, but seems to be related to any remote
access. The issue can be easily reproduced on a local machine, when
doing this:

ssh -X localhost
$ vkcube
MESA-INTEL: warning: Haswell Vulkan support is incomplete
Segmentation fault
$ vkcubepp
MESA-INTEL: warning: Haswell Vulkan support is incomplete
Segmentation fault

vulkaninfo shows a lot of information, seg-faults also at the and.

Alois Schlögl

unread,
Oct 4, 2022, 6:40:04 PM10/4/22
to


Control: reassign 1007832 mesa 20.3.5-1


Further testing shows that this bug is related to mesa 20.3.5-1 (on
bullseye). Therefore, this bug should be assigned to the mesa package.

The issue goes away when a more recent version of mesa is compiled from
source and the resulting libGL.so is preloaded on the remote machine.
Then, vulkan-tools, vulkaninfo will not crash. This suggests the problem
is caused by libgl1 from mesa 20.3.5-1.
Most likely, libGL.so from libgl1 package is the culprit.

The test was successful test, when using using
   mesa/22.1.5 (and later versions including 22.2.0)
   libdrm/2.4.112 and llvm/13.0.1
and configuring mesa with the following options
    -Degl=disabled \
    -Dgles1=enabled \
    -Dgles2=enabled \
    -Dshared-glapi=enabled \
    -Dglx=xlib \
    -Dgallium-drivers=swrast \
    -Dvulkan-drivers=intel,swrast \
    -Dplatforms=x11 \
    -Ddri3=enabled \
    -Ddri-drivers="" \
    -Dswr-arches=avx,avx2,skx \
    -Dcpp_rtti=false \

The test is also successful when compiling mesa/22.2.0 with the
following configuration
    -Dcpp_rtti=false \
    -Ddri3=enabled \
    -Ddri-drivers="" \
    -Degl=disabled \
    -Dgallium-drivers=swrast,virgl,svga,d3d12,zink,iris,crocus,i915 \
    -Dgallium-xa=enabled \
    -Dgles1=enabled \
    -Dgles2=enabled \
    -Dglx=xlib \
    -Dosmesa=true \
    -Dplatforms=x11 \
    -Dshared-glapi=enabled \
-Dvulkan-drivers=intel,swrast,virtio-experimental,imagination-experimental \

Maybe this helps to bisect the issue.
0 new messages