hi Raffi,
sorry that I caught a few deadlines and did not write a reply sooner.
typically, mcx won't run out of GPU global memory even on consumer-grade cards, it is even more unlikely on the A100.
oftentimes, the out-of-memory error was caused by asking more shared memory or constant memory.
NVIDIA GPU's shared memory is somewhat fixed for an architecture, regardless of the models. for A100, it is max to 100KB/SM (older models ranges between 42k to 64k). mcx needs shared memory for temporarily storing detected photon data before their detection. If you have a huge number of media labels, you may have a chance to run out of shared memory. mcx prints the allocated shared memory bytes in the log, you can check to see if it exceeds the capacity of the card.
Constant memory are used for storing optical properties (cfg.prop), src and detector positions (cfg.srcpos, cfg.detpos). Given the typical nvidia gpu constant memory size, current mcx can support up to 4000 mediatype+src num + det num. if this combined length exceeds 4000, you will also get an out-of-memory error.
if you see segfault, it is usually thrown by the host, or the driver, it is likely unrelated to CUDA or GPU compute.
if you have other cuda applications (like deviceQuery), check to see if they run.
if you can't figure it out, please send me a small sample code (reproducer), with data if any, so I can reproduce the issue. if the data is MB in size, put it on a network drive.
Qianqian
--
You received this message because you are subscribed to the Google Groups "mcx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mcx-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mcx-users/786c759b-cdb3-4d53-9b02-64e2843990f2n%40googlegroups.com.