Dear Dr Fang & community
Perhaps this is not an MCX-specific question, so apologies if not...
no worries, mcx/mcxcl share the same mailing list
I am running MCXLABCL on a FEDORA-33 with a NVIDIA GeForce RTX 2080 SUPER.
When my mcxlabcl calls crash, they seem to 'keep' GPU memory resource. This makes debugging hard! Unless I restart Matlab each time / call form the command line (no gui!)
I have similar observations recently when debugging mcxlab in matlab R2020+Ubuntu 20.04 with a RTX2060/driver 460.73.01. However, on my other machines running Ubuntu 16.04/18.04 with lower driver versions, I rarely saw this - not sure if this is a new issue caused by later GPU drivers.
as a test, I added a cudaDeviceReset() call in the error-handling
function, see
https://github.com/fangq/mcx/blob/master/src/mcx_core.cu#L1900
but it does not seem to solve the issue. I still had to restart matlab.
For OpenCL/mcxcl, I am still not able to locate a function to
force resetting a GPU.
I've tried reseting the GPU from matlab (g-gpuDevice(1); reset(g);) but does nothing.
I suppose it does the same as cudaDeviceReset, not entirely sure if it has impact to opencl/mcxcl handling.
I am more curious what had caused the crash - I prefer to fix the crash itself, if you can share a script so I can reproduce - instead of fixing the device reset after a crash.
please let me know.
Qianqian
Any insights appreciated,Thanks and regards,
Alex Antroqbus
--
You received this message because you are subscribed to the Google Groups "mcx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mcx-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mcx-users/822366f5-89a2-4b07-bc35-c8e7259f4d30n%40googlegroups.com.
Thanks Dr Fang. I will try put together a MWE that causes the crash and send over.
On a different point: I watched the online seminar you gave last Monday and found it very informative about the general MCX landscape - what packages are out there and how they relate.
hi Alex
one of my slides contains a diagram (attached here), showing the
overall tools we are providing/developing - basically we maintain
3 code bases - mcx (cuda), mcx (opencl), and mmc(sse/opencl/cuda),
each tool contains two interfaces - a standalone executable, and a
matlab/octave mex file. So, that makes 6 combinations (mcx, mcxcl,
mmc, mcxlab, mcxlabcl and mmclab). On top of that, we have a
unified GUI - mcxstudio to help create JSON input files for each
tool. For more info, please check out our wiki
Each of the codes contains numerous options/flags to support different types of simulations and settings - a list of the options can be found here
http://mcx.space/#optionhelp (then click on "7-Options")
If you install our all-in-one MCXStudio package/installer (see
https://twitter.com/FangQ/status/1279935567095087106), it will
install all 6 components, plus other small utilities (mcx/utils,
mmc/matlab ...)
Our packages are also available on dockerhub for cloud based
processing/automation
I was wondering if you would be willing to share your slides from your talk? Even if a reduced / abridged set, removing anything you would rather not share.
sent it to you offline. please check your email.
Qianqian
Thanks Dr Fang. I will try put together a MWE that causes the crash and send over.
...
Morning Dr Fang
Thank you very much for the comprehensive and informative response. Very helpful.
Since I am running Fedora (in fact, I'm using the NeuroFedora image), I thought I read somewhere that using OpenCL mcx (mcxlabcl) was recommended for this OS, even with NVIDIA cards (maybe because CUDA on linux used to be a pain?)Anyway, I cannot find any reference to this now (can't recall where I read it) - but I CAN say that the plot thickens....
Consider two runs of MCXLAB/MCXLABCL (looped over many times, as in the script I posted before). I find the following:
In OCTAVE (using the mcxlabcl version distributed with NeuroFedora)flux=mcxlabcl(cfg) <- this does NOT result in memory saturation[flux,detp]=mcxlabcl(cfg) <- this DOES result in memory saturation
hi Alex
the mcxlabcl that I uploaded in NeuroFedora was an old release - v0.9.5 (v2019.10).
between then and now, the biggest change in terms of OpenCL use
is the duplication of read-only GPU buffers for each GPU,
otherwise, multi-GPU can not be used simultaneously on NVIDIA
cards (but does work on AMD/Intel devices)
https://github.com/fangq/mcxcl/commit/c1e3ebbe995724436a3f627d4826582d2a9a4f5c
aside from that change, I see the memory leakage caused by the detected photon buffer Pdet already exists in v0.9.5
because of that, I expect there should be some level of memory
accumulation - not necessarily saturation - on this version.
in MATLAB (using the latest stable release, Furios Fermion 2020 , precompiled...)flux=mcxlabcl(cfg) <- this DOES result in memory saturation[flux,detp]=mcxlabcl(cfg) <- this also DOES result in memory saturation.. but ....flux=mcxlab(cfg) <- this DOES NOT result in memory saturation[flux,detp]=mcxlab(cfg) <- this also DOES NOT result in memory saturation
this aligns with the latest observations.
for now, please use mcxlab to take advantage of the higher speed
and robust memory deallocation. I will find out if the opencl
related memory leak is related to the buffer duplication change
that I mentioned above.
Qianqian