Random CUDA errors and how to handle them

384 views
Skip to first unread message

Kirill Moskovtsev

unread,
Feb 13, 2018, 11:43:03 AM2/13/18
to hoomd-users
I'm using HOOMD and 99% of the time (or more) it works perfectly. But once in a while I get a random CUDA error, which is not reproducible. For instance, sometimes hoomd.context.initialize fails with a CUDA error. More recently I got the following error:

RuntimeError: CUDA Error
**ERROR**: an illegal memory access was encountered before /hoomd/GPUArray.h:672
This one is apparently a memory deallocation error. It happens in the middle of a run.

Those errors seem to appear randomly (and rarely) in an otherwise working code. 
So my questions are:

1) do these errors necessarily mean an error in the code (or compilation issues)?
2) is there a way to catch those errors from the Python script and continue running?

For the second part, I tried to catch an exception with try:except, and then try again (hoomd.context.initialize), but usually it fails again.

I'm using v2.1.9 and I did create my own plugins: forces and integrators. It is running on k-80 Teslas in our cluster.

Jens Glaser

unread,
Feb 13, 2018, 11:46:07 AM2/13/18
to hoomd...@googlegroups.com
Hi Kirill,

On Feb 13, 2018, at 11:43 AM, Kirill Moskovtsev <kmosk...@gmail.com> wrote:

I'm using HOOMD and 99% of the time (or more) it works perfectly. But once in a while I get a random CUDA error, which is not reproducible. For instance, sometimes hoomd.context.initialize fails with a CUDA error. More recently I got the following error:

If it fails in context.initialize() this would hint at a GPU problem, not a problem with the code.

RuntimeError: CUDA Error
**ERROR**: an illegal memory access was encountered before /hoomd/GPUArray.h:672
This one is apparently a memory deallocation error. It happens in the middle of a run.


We will need to track this down further. Can you run with —gpu_error_checking and give us the exact location of the failure?
Also, a run with cuda-memcheck will be useful output.

- Jens

Those errors seem to appear randomly (and rarely) in an otherwise working code. 
So my questions are:

1) do these errors necessarily mean an error in the code (or compilation issues)?
2) is there a way to catch those errors from the Python script and continue running?

For the second part, I tried to catch an exception with try:except, and then try again (hoomd.context.initialize), but usually it fails again.

I'm using v2.1.9 and I did create my own plugins: forces and integrators. It is running on k-80 Teslas in our cluster.

--
You received this message because you are subscribed to the Google Groups "hoomd-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hoomd-users...@googlegroups.com.
To post to this group, send email to hoomd...@googlegroups.com.
Visit this group at https://groups.google.com/group/hoomd-users.
For more options, visit https://groups.google.com/d/optout.

Kirill Moskovtsev

unread,
Feb 13, 2018, 12:14:36 PM2/13/18
to hoomd...@googlegroups.com
Jens,

Thanks for the quick reply! I will try to reproduce that error with gpu_error_checking, it will take a while though, may be a couple of days.
With cuda-memcheck, should I use any options?

Thanks,
Kirill

On Tue, Feb 13, 2018 at 11:44 AM, Jens Glaser <jens....@gmail.com> wrote:
Hi Kirill,

On Feb 13, 2018, at 11:43 AM, Kirill Moskovtsev <kmosk...@gmail.com> wrote:

I'm using HOOMD and 99% of the time (or more) it works perfectly. But once in a while I get a random CUDA error, which is not reproducible. For instance, sometimes hoomd.context.initialize fails with a CUDA error. More recently I got the following error:

If it fails in context.initialize() this would hint at a GPU problem, not a problem with the code.

RuntimeError: CUDA Error
**ERROR**: an illegal memory access was encountered before /hoomd/GPUArray.h:672
This one is apparently a memory deallocation error. It happens in the middle of a run.


We will need to track this down further. Can you run with —gpu_error_checking and give us the exact location of the failure?
Also, a run with cuda-memcheck will be useful output.

- Jens
Those errors seem to appear randomly (and rarely) in an otherwise working code. 
So my questions are:

1) do these errors necessarily mean an error in the code (or compilation issues)?
2) is there a way to catch those errors from the Python script and continue running?

For the second part, I tried to catch an exception with try:except, and then try again (hoomd.context.initialize), but usually it fails again.

I'm using v2.1.9 and I did create my own plugins: forces and integrators. It is running on k-80 Teslas in our cluster.

--
You received this message because you are subscribed to the Google Groups "hoomd-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hoomd-users+unsubscribe@googlegroups.com.

To post to this group, send email to hoomd...@googlegroups.com.
Visit this group at https://groups.google.com/group/hoomd-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "hoomd-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hoomd-users/y4JGIYVpeQM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hoomd-users+unsubscribe@googlegroups.com.

Jens Glaser

unread,
Feb 13, 2018, 12:18:30 PM2/13/18
to hoomd...@googlegroups.com
For now, just
$ cuda-memcheck python my_script.py

J

To unsubscribe from this group and stop receiving emails from it, send an email to hoomd-users...@googlegroups.com.

Kirill Moskovtsev

unread,
Feb 16, 2018, 4:31:56 PM2/16/18
to hoomd-users
Jens,

I can't reproduce those errors anymore. But I think I got the point: these kind of errors may be caused by hardware problems, and it seems the only way to solve them is to restart the job. It's not a big deal to me, I just wanted to know if there was a more clever way to handle it without restarting the entire job.

Thanks,
Kirill
J


Thanks,
Kirill

Hi Kirill,
To unsubscribe from this group and stop receiving emails from it, send an email to hoomd-users...@googlegroups.com.
To post to this group, send email to hoomd...@googlegroups.com.
Visit this group at https://groups.google.com/group/hoomd-users.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "hoomd-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hoomd-users/y4JGIYVpeQM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hoomd-users...@googlegroups.com.

To post to this group, send email to hoomd...@googlegroups.com.
Visit this group at https://groups.google.com/group/hoomd-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages