ctest - GPU related tests fail

219 views
Skip to first unread message

William Torre

unread,
Jun 3, 2021, 6:25:54 AM6/3/21
to hoomd-users
Hello,

I am trying to install the latest version of HOOMD-blue on a cluster (in my environment) to perform simulation using GPUs. Using CUDA 11.3, I was able to configure and compile it but now, before installing the software, I am trying to run "ctest" and many tests fail. The issue seems to be related to the GPU somehow but we could not find the actual reason. The identical error message that I get for each failed test is:

"unexpected test termination: No supported GPUs are present on this system.
Failed to get GPU device count: unknown error"

In trying to fix the issue I've already added nvcc to the PATH in my environment (in .bashrc) but nothing changed. The GPU used is an NVIDIA Tesla V100-PCIE-16GB. Thank you in advance for your patience!

William




Joshua Anderson

unread,
Jun 3, 2021, 11:55:42 AM6/3/21
to hoomd...@googlegroups.com
William,

This error indicates that the CUDA driver reported an error when requesting the device count. This could be, e.g. because the node you are on does not have a GPU or your job did not request GPU resources.

What is the output of `nvidia-smi`?

Are you using HOOMD v3? If so, can you past the full error output from pytest.

To get complete error output from ctest, you need to run with the option --output-on-failure.
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan
> --
> You received this message because you are subscribed to the Google Groups "hoomd-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hoomd-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hoomd-users/29ebadb6-fc12-404e-ae25-80e69b9d3f11n%40googlegroups.com.

William Torre

unread,
Jun 4, 2021, 5:48:56 AM6/4/21
to hoomd-users
Hello Joshua,

Thank you for your reply. I am running the HOOMD v3 tests from a node with GPU and the output from nvidia-smi is indeed:

 NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA Tesla V1...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   60C    P0    49W / 250W |      0MiB / 16160MiB |      4%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                 

And the output from "ctest --output-on-failure" is attached. Thank you in advance for your time.

William
error.txt

Joshua Anderson

unread,
Jun 4, 2021, 6:35:41 AM6/4/21
to hoomd...@googlegroups.com
William,

Can you run other CUDA applications, or is this a HOOMD specific issue?

Can you post the json file that this script writes, it includes more details that might help troubleshoot this issue:
```
import hoomd
sim = hoomd.Simulation(device=hoomd.device.CPU())
sim.write_debug_data("debug.json")
```

Otherwise, if I had to guess, I'd say it's possible that your user doesn't have access to the NVIDIA devices or possibly that the environment variable CUDA_VISIBLE_DEVICES is set to an invalid value- check with:
```
$ ls -l /dev/nv*
echo $CUDA_VISIBLE_DEVICES
```

------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan

> To view this discussion on the web visit https://groups.google.com/d/msgid/hoomd-users/41f1509f-6bf2-4f5f-b693-4aa9dc13bf59n%40googlegroups.com.
> <error.txt>

William Torre

unread,
Jun 4, 2021, 9:38:20 AM6/4/21
to hoomd-users
I am not sure yet if is a HOOMD specific problem. In any case, when I tried to run the script I get the following error:

Traceback (most recent call last):
  File "create_json.py", line 1, in <module>
    import hoomd
  File "./software/hoomd-blue/build/hoomd/__init__.py", line 13, in <module>
    from hoomd import version
  File "./software/hoomd-blue/build/hoomd/version.py", line 36, in <module>
    from hoomd import _hoomd
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by ./software/hoomd-blue/build/hoomd/_hoomd.cpython-36m-x86_64-linux-gnu.so)

Maybe I should mention that in order to successfully compile HOOMD I had to add the flags "-ldl -lpthread -lutil" for gcc.

Regarding the CUDA_VISIBLE_DEVICES, it returns "0" so I suppose is fine.

Joshua Anderson

unread,
Jun 4, 2021, 10:07:13 AM6/4/21
to hoomd...@googlegroups.com
William,

`version `GLIBC_2.27' not found` is a very strange error to get if you compiled HOOMD on that system. It indicates that the binary you compiled was linked against a libc that is newer than is available on the system where you are trying to run hoomd. That usually only happens when the host you compile on has a different OS distribution than where you run.

Did you build HOOMD on the cluster head node? Does the GPU compute node have a different configuration that the head node? You could try building HOOMD on the compute node to resolve this.

Also, I'm not aware of any systems needing manual library specifications like `-ldl -lpthread -lutil`. What linux distribution and compiler (include versions for each) are you using? CMake should set all necessary library and include paths.
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan

> To view this discussion on the web visit https://groups.google.com/d/msgid/hoomd-users/173e203c-ba00-4838-8c74-ac80c7fa2419n%40googlegroups.com.

William Torre

unread,
Jun 4, 2021, 11:09:04 AM6/4/21
to hoomd-users
I am using CentOS Linux 7 (Core) and gcc 10.2.0. The latter is installed in my environment via GUIX. At the moment I am trying to build HOOMD in a normal session with ENABLE_GPU=off, just to check if the `version `GLIBC_2.27' not found` is still there when I try to import hoomd.

Joshua Anderson

unread,
Jun 4, 2021, 12:40:36 PM6/4/21
to hoomd...@googlegroups.com
William,

I'm not familiar with GUIX but it is entirely possible that the gcc provided by it generates code that uses a newer glibc than is on the system. Most clusters provide modern compilers as a module that you can load - which the administrators of the cluster built to be compatible with the OS.
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan

> To view this discussion on the web visit https://groups.google.com/d/msgid/hoomd-users/12f20d99-f2bb-4cfe-a407-254d538ca345n%40googlegroups.com.

William Torre

unread,
Jun 22, 2021, 5:27:35 AM6/22/21
to hoomd-users
Hello Joshua,

Thank you for your previous assistance. After trying to build hoomd without GPUs several times, I got the same error "GLIBC_2.X". So, in order to try to save some time I would like to use it via "singularity" (which is already installed in the HPC cluster that I'm using). However, I am now wondering how can I install an external plugin without having hoomd installed from source. Is this possible?

Kind regards,
William

Joshua Anderson

unread,
Jun 28, 2021, 12:08:03 PM6/28/21
to hoomd...@googlegroups.com
William,

Singularity images are static files that contain the entire OS and software stack. To add a plugin, you would need to build your own singularity image. I think you would be more likely to have success building it using the compiler toolchain provided on your cluster, not the one by GUIX. Contact your cluster administrator for assistance.
------
Joshua A. Anderson, Ph.D.
Research Area Specialist, Chemical Engineering, University of Michigan

> To view this discussion on the web visit https://groups.google.com/d/msgid/hoomd-users/a2dbf71b-b961-48da-bcf6-663affdfca0bn%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages