Out of memory error for high photon counts on A100 GPU

101 views
Skip to first unread message

Raffi Hotter

unread,
Nov 27, 2023, 9:24:49 PM11/27/23
to mcx-users
Hi Dr. Fang,

I'm running pmcx on a 150 x 150 x 1 grid. When I go to high photon counts (10^9), I often (but not always) get an out of memory error:

RuntimeError: PMCX terminated due to an exception!Error from thread (0): out of memory

My parameters are: 
nphoton: 1e+09 tstart: 0 tstep: 1e-08 tend: 1e-08 maxdetphoton: 1e+09 issrcfrom0: 1 replaydet: -1
issavedet: 1 issaveseed: 1

I am using a 40 GB A100, so the medium should definitely fit in memory. I also got the same error for the Nvidia V100 GPU. When I debug with nvidia-smi, I see nearly 80% memory utilization. Sometimes when I restart my Jupyter kernel, it works again, but most of the time it does not.

Any idea what might be wrong?

Raffi

Raffi Hotter

unread,
Dec 6, 2023, 12:58:00 PM12/6/23
to mcx-users
I now no longer get an out of memory error (though oddly, I don't believe anything was changed). I now get Segmentation faults though.

Qianqian Fang

unread,
Dec 6, 2023, 5:49:45 PM12/6/23
to mcx-...@googlegroups.com, Raffi Hotter

hi Raffi,

sorry that I caught a few deadlines and did not write a reply sooner.

typically, mcx won't run out of GPU global memory even on consumer-grade cards, it is even more unlikely on the A100.

oftentimes, the out-of-memory error was caused by asking more shared memory or constant memory.

NVIDIA GPU's shared memory is somewhat fixed for an architecture, regardless of the models. for A100, it is max to 100KB/SM (older models ranges between 42k to 64k). mcx needs shared memory for temporarily storing detected photon data before their detection. If you have a huge number of media labels, you may have a chance to run out of shared memory. mcx prints the allocated shared memory bytes in the log, you can check to see if it exceeds the capacity of the card.

Constant memory are used for storing optical properties (cfg.prop), src and detector positions (cfg.srcpos, cfg.detpos). Given the typical nvidia gpu constant memory size, current mcx can support up to 4000 mediatype+src num + det num. if this combined length exceeds 4000, you will also get an out-of-memory error.


if you see segfault, it is usually thrown by the host, or the driver, it is likely unrelated to CUDA or GPU compute.


if you have other cuda applications (like deviceQuery), check to see if they run.


if you can't figure it out, please send me a small sample code (reproducer), with data if any, so I can reproduce the issue. if the data is MB in size, put it on a network drive.


Qianqian

--
You received this message because you are subscribed to the Google Groups "mcx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mcx-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mcx-users/786c759b-cdb3-4d53-9b02-64e2843990f2n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages