segmentation violation error at low timestep, but GPU memory not at full utilisation

70 views
Skip to first unread message

Thomas Rialan

unread,
Nov 18, 2024, 8:19:21 AM11/18/24
to mcx-users
Hi Dr. Fang,

Great fan of MCX, thank you for your work. For the most part I've been able to get mcxlab to work well, but I've spent the last 3-4 days trying to get my new simulations to work with the correct number of timesteps without segmentation violation errors (code works fine for fewer timesteps).

My problem: the GPU has 40GB of memory, my simulation uses at most 20GB when tstep=5e-12 (as I expect from calculations provided), RAM is also plentiful, but I get a segmentation violation error. If tstep=1e-11, peak GPU usage is 10GB/40GB and there is no error. I expect MCX needs a few spare GB for various functions, but still GPU memory doesn't seem to be the issue.

What do you think might be causing the segmentation violation?

I am currently running with the latest binaries (18 november 2024) from this page: https://mcx.space/nightly/linux64/.

I provide the full code, output log, error log below:

My script:

disp('Starting MATLAB script execution...');add_paths();disp('Hello friends');mcxlab('gpuinfo')clear cfg cfgs;%cfg.gpuid='11111111'; % use 8 GPUs togethercfg.gpuid=1;cfg.autopilot = 1;cfg.seed = hex2dec('623F9A9E');cfg.srctype='gaussian';cfg.unitinmm = 1;cfg.bc = 'rrrrrr'; % Reflective on all faces except the top (+z) facecfg.nphoton=1e8; cfg.tstart=0;cfg.tend=1e-9; % Total simulation timecfg.tstep=5e-12; % Time step (5e-12)cfg.issrcfrom0 = 1; % Positions are from (0,0,0)cfg.isspecular = 0; %reflection on air-surface boundary (source outside so 0)cfg.isreflect = 0;% cfg.workload = [12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5];% cfg.workload = [10, 10];disp('base config set, loading data'); cfg.isgpuinfo = 1; % Read and set volume data raw = jsondecode(fileread('/u/trialan/Geometric-eigenmodes/templates_jnifti/P03_5tissues.jnii')); volumeData = raw.struct.NIFTIData.('x_ArrayData_'); volumeData = reshape(volumeData, [192, 256, 256]); volumeData = volumeData./50; % ensuring indexing is proper (0,1,...) cfg.vol=volumeData; cfg.vol=uint8(cfg.vol); % unsigned 8-bit integers for mcxlab disp('Data has been loaded'); % Set medium properties (NB: ensure indexing is proper cfg.prop = [ 0 0 1 1; % medium 0: environment 0.0195 14.75 0.92 1.37; % medium 1: grey matter 0.0195 14.75 0.92 1.37; % medium 2: white matter 0.0021 0.125 0.92 1.37; % medium 3: CSF 0.0125 11.625 0.92 1.37; % medium 4: skull 0.0177 9.125 0.92 1.37; % medium 5: skin ]; % Set source positions, source directions, and detectors (NB: flipped x<->z) cfg.srcpos = [ 96.307083, 68.003639, 55.998356; 44.174034, 77.827965, 81.520073; 66.802277, 52.199722, 96.766548; 26.951130, 114.490875, 108.559990; 46.643810, 66.358185, 117.232712; 94.124985, 43.001999, 120.994629; 28.002001, 95.378166, 145.395218; 64.492905, 57.509098, 152.482513; 26.002001, 131.971420, 165.808945; 58.356976, 69.645020, 180.162170; 92.075584, 59.001999, 184.685333; 45.633389, 110.609413, 202.631393; 69.145714, 74.697334, 200.841049; 114.983376, 56.565464, 172.580093; 91.411972, 96.323570, 219.998001; 67.198456, 136.474472, 222.196457; ]; cfg.srcdir = [ -0.000000, 0.707107, 0.707107; 0.999998, 0.002000, -0.000000; 0.707107, 0.707107, -0.000000; 0.577350, 0.577350, 0.577350; 0.707107, 0.707107, -0.000000; -0.000000, 0.999998, -0.002000; 1.000000, -0.000000, -0.000000; 0.577735, 0.577735, -0.576580; 0.999998, -0.000000, -0.002000; 0.577735, 0.577735, -0.576580; -0.000000, 0.999998, -0.002000; 0.707107, -0.000000, -0.707107; 0.576580, 0.577735, -0.577735; -0.577350, 0.577350, -0.577350; -0.000000, -0.000000, -1.000000; 0.707107, -0.000000, -0.707107; ]; det_radius = 3; % set detector radius cfg.detpos = [ 93.488235, 86.575806, 44.426189, det_radius; 68.718773, 89.161018, 48.001999, det_radius; 58.665642, 77.985718, 59.350636, det_radius; 53.143295, 94.157356, 56.858704, det_radius; 127.412392, 82.119255, 54.295128, det_radius; 92.999161, 46.771332, 88.230667, det_radius; 52.008480, 65.300247, 86.693275, det_radius; 65.663345, 52.001999, 123.162247, det_radius; 32.296879, 87.481064, 112.224052, det_radius; 47.998852, 67.003151, 150.051773, det_radius; 92.949844, 49.001999, 152.478653, det_radius; 124.711327, 51.001999, 121.583382, det_radius; 23.002001, 124.355942, 137.226669, det_radius; 31.779688, 100.710503, 174.777695, det_radius; 65.172417, 65.001999, 180.371765, det_radius; 124.373878, 62.001999, 180.130127, det_radius; 41.946758, 132.513260, 196.944763, det_radius; 62.178677, 77.022934, 198.199615, det_radius; 98.457642, 72.045975, 202.043976, det_radius; 144.441986, 90.919327, 201.475327, det_radius; 60.091969, 133.353287, 217.089966, det_radius; 62.669739, 104.033295, 214.998001, det_radius; 128.814774, 102.993713, 216.998001, det_radius; 97.552841, 132.920670, 229.918671, det_radius; ]; %%% SIMULATING AND MCXPLOTVOL disp("About to run mcxlab"); [flux, detp, vol, seeds] = mcxlab(cfg); %mcxplotvol(log10(flux.data)); disp("Simulation has run bby"); % Set up the replay configuration cfg_replay = cfg; cfg_replay.seed = seeds.data; cfg_replay.detphotons = detp.data; cfg_replay.outputtype = 'jacobian'; % Output absorption Jacobian cfg_replay.replaydet = 0; % Replay all detected photon disp("about to run replay"); % Run the replay simulation to get the Jacobian jacobian = mcxlab(cfg_replay); disp("replay has run");
and add_paths is:

function add_paths() % Function to add necessary toolbox paths %toolbox_root = 'C:\Users\Lenovo\AppData\Local\MathWorks\MATLAB\R2024b\'; toolbox_root = '/u/trialan'; addpath(fullfile(toolbox_root, 'mcxlab')); addpath(fullfile(toolbox_root, 'mcx')); addpath(fullfile(toolbox_root, 'jsonlab')); addpath(fullfile(toolbox_root, 'zmat')); addpath(fullfile(toolbox_root, 'mcx', 'utils'));end
If I set cfg.tstep=1e-11 instead of cfg.tstep=5e-12. then there is no error and everything works (in this case, peak GPU memory usage is about 10GB, which seems odd).

My expectation is that the memory usage should be ~ 192*256^2*4*2*200 bytes, or around 20GB (following your calculation here: https://groups.google.com/g/mcx-users/c/w_YL7M6G-e8/m/NybKPx3jAAAJ). And indeed tracking memory with nvidia-smi in the background indicates peak usage at around 20GB.

Unfortunately I cannot share the data I am running this on as it is medical data I'm not allowed to share, sorry about that.

The full error is this:

Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: NoneCurrently Loaded Modules: 1) gcc/11.4.0 3) cuda/11.8.0 5) slurm-env/0.1 7) matlab/2024a 2) openmpi/4.1.6 4) cue-login-env/1.0 6) default-s11 -------------------------------------------------------------------------------- Segmentation violation detected at 2024-11-18 05:53:07 -0600--------------------------------------------------------------------------------Configuration: Crash Decoding : Disabled - No sandbox or build area path Crash Mode : continue (default) Default Encoding : UTF-8 Deployed : false GNU C Library : 2.28 stable Graphics Driver : Uninitialized software Graphics card 1 : 0x10de ( 0x10de ) 0x2235 Version 550.90.7.0 (0-0-0) Graphics card 2 : 0x10de ( 0x10de ) 0x2235 Version 550.90.7.0 (0-0-0) Graphics card 3 : 0x10de ( 0x10de ) 0x2235 Version 550.90.7.0 (0-0-0) Graphics card 4 : 0x102b ( 0x102b ) 0x538 Version 0.0.0.0 (0-0-0) Graphics card 5 : 0x10de ( 0x10de ) 0x2235 Version 550.90.7.0 (0-0-0) Java Version : Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode MATLAB Architecture : glnxa64 MATLAB Entitlement ID : 7087517 MATLAB Root : /sw/external/matlab/2024a MATLAB Version : 24.1.0.2653294 (R2024a) Update 5 OpenGL : software Operating System : "Red Hat Enterprise Linux release 8.8 (Ootpa)" Process ID : 3009653 Processor ID : x86 Family 25 Model 1 Stepping 1, AuthenticAMD Session Key : 51d243fc-6507-4a61-afb3-6c43729cab7a Window System : No active display Fault Count: 1 Abnormal termination: Segmentation violation Current Thread: 'MCR 0 interpret' id 140339235780352 Register State (from fault): RAX = 00007f9c853fc020 RBX = fffffffe58000000 RCX = 00007f9add3fc000 RDX = fffffffbf9ffe060 RSP = 00007fa3464b7748 RBP = 00007fa3464b8ea0 RSI = 00007fa13b3fefb0 RDI = 00007f9ee33fdfc0 R8 = ffffffffffffffe0 R9 = 00007fa3400008d2 R10 = 00007f9add3fc020 R11 = 00007f9c853fc020 R12 = 00007f9edd3fd010 R13 = 00007fa3464b9190 R14 = 00007fa3464b9710 R15 = 0000000000000000 RIP = 00007fa3abd1111e EFL = 0000000000010282 CS = 0033 FS = 0000 GS = 0000 Stack Trace (from fault): [ 0] 0x00007fa3abd1111e /lib64/libc.so.6+00848158 [ 1] 0x00007fa25a83c5de /u/trialan/mcxlab/mcx.mexa64+00194014 mexFunction+00004578 [ 2] 0x00007fa38f8880df /sw/external/matlab/2024a/bin/glnxa64/libmex.so+00966879 [ 3] 0x00007fa38f888157 /sw/external/matlab/2024a/bin/glnxa64/libmex.so+00966999 [ 4] 0x00007fa38f8881c7 /sw/external/matlab/2024a/bin/glnxa64/libmex.so+00967111 [ 5] 0x00007fa38f88977a /sw/external/matlab/2024a/bin/glnxa64/libmex.so+00972666 [ 6] 0x00007fa38f875234 /sw/external/matlab/2024a/bin/glnxa64/libmex.so+00889396 [ 7] 0x00007fa38ff71d66 /sw/external/matlab/2024a/bin/glnxa64/libmwm_dispatcher.so+01535334 _ZN8Mfh_file20dispatch_file_commonEMS_FviPP11mxArray_tagiS2_EiS2_iS2_+00000166 [ 8] 0x00007fa38ff7334c /sw/external/matlab/2024a/bin/glnxa64/libmwm_dispatcher.so+01540940 [ 9] 0x00007fa38ff736ee /sw/external/matlab/2024a/bin/glnxa64/libmwm_dispatcher.so+01541870 _ZN8Mfh_file8dispatchEiPSt10unique_ptrI11mxArray_tagN6matrix6detail17mxDestroy_deleterEEiPPS1_+00000030 [ 10] 0x00007fa38f389d02 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+02555138 [ 11] 0x00007fa38f38b264 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+02560612 [ 12] 0x00007fa384c4b72f /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+11548463 [ 13] 0x00007fa384c5685f /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+11593823 [ 14] 0x00007fa384bcbf02 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+11026178 [ 15] 0x00007fa3848dc000 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07946240 [ 16] 0x00007fa3848de31c /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07955228 [ 17] 0x00007fa3848db8bb /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07944379 [ 18] 0x00007fa3848ecbbf /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+08014783 [ 19] 0x00007fa3848ed619 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+08017433 [ 20] 0x00007fa3848db6c4 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07943876 [ 21] 0x00007fa3848db7c6 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07944134 [ 22] 0x00007fa384a3677b /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+09365371 [ 23] 0x00007fa384a3a856 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+09381974 [ 24] 0x00007fa38f50fe24 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+04152868 [ 25] 0x00007fa38f377b91 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+02481041 [ 26] 0x00007fa38f37add5 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+02493909 [ 27] 0x00007fa38ff71d66 /sw/external/matlab/2024a/bin/glnxa64/libmwm_dispatcher.so+01535334 _ZN8Mfh_file20dispatch_file_commonEMS_FviPP11mxArray_tagiS2_EiS2_iS2_+00000166 [ 28] 0x00007fa38ff7334c /sw/external/matlab/2024a/bin/glnxa64/libmwm_dispatcher.so+01540940 [ 29] 0x00007fa38ff736ee /sw/external/matlab/2024a/bin/glnxa64/libmwm_dispatcher.so+01541870 _ZN8Mfh_file8dispatchEiPSt10unique_ptrI11mxArray_tagN6matrix6detail17mxDestroy_deleterEEiPPS1_+00000030 [ 30] 0x00007fa38f389d02 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+02555138 [ 31] 0x00007fa38f38b264 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+02560612 [ 32] 0x00007fa384c4b72f /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+11548463 [ 33] 0x00007fa384c3df6d /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+11493229 [ 34] 0x00007fa384bcc0e2 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+11026658 [ 35] 0x00007fa3848dc000 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07946240 [ 36] 0x00007fa3848de31c /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07955228 [ 37] 0x00007fa3848db8bb /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07944379 [ 38] 0x00007fa3848ecbbf /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+08014783 [ 39] 0x00007fa3848ed619 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+08017433 [ 40] 0x00007fa3848db6c4 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07943876 [ 41] 0x00007fa3848db7c6 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+07944134 [ 42] 0x00007fa384a3677b /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+09365371 [ 43] 0x00007fa384a3a856 /sw/external/matlab/2024a/bin/glnxa64/libmwm_lxe.so+09381974 [ 44] 0x00007fa38f50fe24 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+04152868 [ 45] 0x00007fa38f3eaebf /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+02952895 [ 46] 0x00007fa38f3f2577 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+02983287 [ 47] 0x00007fa38f4b31c5 /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+03772869 [ 48] 0x00007fa38f4b340e /sw/external/matlab/2024a/bin/glnxa64/libmwlxemainservices.so+03773454 [ 49] 0x00007fa38fd3329f /sw/external/matlab/2024a/bin/glnxa64/libmwm_interpreter.so+01434271 _Z51inEvalCmdWithLocalReturnInDesiredWSAndPublishEventsRKNSt7__cxx1112basic_stringIDsSt11char_traitsIDsESaIDsEEEPibbP15inWorkSpace_tagN9MathWorks3lxe10EvalSourceE+00000063 [ 50] 0x00007fa3943b313e /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00971070 _ZNK3iqm18InternalEvalPlugin24inEvalCmdWithLocalReturnERKNSt7__cxx1112basic_stringIDsSt11char_traitsIDsESaIDsEEEP15inWorkSpace_tag+00000110 [ 51] 0x00007fa3943b4124 /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00975140 _ZN3iqm18InternalEvalPlugin7executeEP15inWorkSpace_tag+00000420 [ 52] 0x00007fa39439c54f /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00877903 [ 53] 0x00007fa394367a56 /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00662102 [ 54] 0x00007fa394367e82 /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00663170 [ 55] 0x00007fa3a996f0ff /sw/external/matlab/2024a/bin/glnxa64/libmwmlutil.so+09703679 _ZNK14cmddistributor16IIPRunNowMessage7deliverERKN10foundation7msg_svc8exchange7RoutingE+00000031 [ 56] 0x00007fa3aa4f9e7a /sw/external/matlab/2024a/bin/glnxa64/libmwms.so+03362426 _ZN10foundation7msg_svc8exchange12MessageQueue7deliverERKN7mwboost10shared_ptrIKNS1_8EnvelopeEEE+00000250 [ 57] 0x00007fa3aa4fb190 /sw/external/matlab/2024a/bin/glnxa64/libmwms.so+03367312 [ 58] 0x00007fa3aa4e32dd /sw/external/matlab/2024a/bin/glnxa64/libmwms.so+03269341 [ 59] 0x00007fa3aa4e6f8c /sw/external/matlab/2024a/bin/glnxa64/libmwms.so+03284876 [ 60] 0x00007fa3aa4e2d47 /sw/external/matlab/2024a/bin/glnxa64/libmwms.so+03267911 [ 61] 0x00007fa3a98a77f1 /sw/external/matlab/2024a/bin/glnxa64/libmwmlutil.so+08886257 [ 62] 0x00007fa3a98afb55 /sw/external/matlab/2024a/bin/glnxa64/libmwmlutil.so+08919893 [ 63] 0x00007fa3acfaa56e /sw/external/matlab/2024a/bin/glnxa64/libmwrcf_framework.so+00304494 _ZN7mwboost6detail17shared_state_base13wait_internalERNS_11unique_lockINS_5mutexEEEb+00000222 [ 64] 0x00007fa39421fa62 /sw/external/matlab/2024a/bin/glnxa64/libmwmcr.so+00719458 _ZN7mwboost6futureIvE3getEv+00000098 [ 65] 0x00007fa39420c360 /sw/external/matlab/2024a/bin/glnxa64/libmwmcr.so+00639840 [ 66] 0x00007fa3accf3634 /sw/external/matlab/2024a/bin/glnxa64/libmwmvm.so+03384884 _ZN14cmddistributor15PackagedTaskIIP10invokeFuncIN7mwboost8functionIFvvEEEEENS2_10shared_ptrINS2_6futureIDTclfp_EEEEEERKT_+00000068 [ 67] 0x00007fa3accf38e9 /sw/external/matlab/2024a/bin/glnxa64/libmwmvm.so+03385577 _ZNSt17_Function_handlerIFN7mwboost3anyEvEZN14cmddistributor15PackagedTaskIIP10createFuncINS0_8functionIFvvEEEEESt8functionIS2_ET_EUlvE_E9_M_invokeERKSt9_Any_data+00000025 [ 68] 0x00007fa3943bd8cd /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+01013965 _ZN3iqm18PackagedTaskPlugin7executeEP15inWorkSpace_tag+00000093 [ 69] 0x00007fa39439c54f /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00877903 [ 70] 0x00007fa3943669b8 /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00657848 [ 71] 0x00007fa38f95dab9 /sw/external/matlab/2024a/bin/glnxa64/libmwbridge.so+00498361 [ 72] 0x00007fa38f95df43 /sw/external/matlab/2024a/bin/glnxa64/libmwbridge.so+00499523 [ 73] 0x00007fa38f979592 /sw/external/matlab/2024a/bin/glnxa64/libmwbridge.so+00611730 _Z22mnGetCommandLineBufferbRbN7mwboost8optionalIKP15inWorkSpace_tagEEbRKNS0_9function2IN6mlutil14cmddistributor17inExecutionStatusERKNSt7__cxx1112basic_stringIDsSt11char_traitsIDsESaIDsEEES4_EE+00000210 [ 74] 0x00007fa38f9798f9 /sw/external/matlab/2024a/bin/glnxa64/libmwbridge.so+00612601 _Z8mnParserv+00000521 [ 75] 0x00007fa394242c9f /sw/external/matlab/2024a/bin/glnxa64/libmwmcr.so+00863391 [ 76] 0x00007fa3accf3634 /sw/external/matlab/2024a/bin/glnxa64/libmwmvm.so+03384884 _ZN14cmddistributor15PackagedTaskIIP10invokeFuncIN7mwboost8functionIFvvEEEEENS2_10shared_ptrINS2_6futureIDTclfp_EEEEEERKT_+00000068 [ 77] 0x00007fa3accf38e9 /sw/external/matlab/2024a/bin/glnxa64/libmwmvm.so+03385577 _ZNSt17_Function_handlerIFN7mwboost3anyEvEZN14cmddistributor15PackagedTaskIIP10createFuncINS0_8functionIFvvEEEEESt8functionIS2_ET_EUlvE_E9_M_invokeERKSt9_Any_data+00000025 [ 78] 0x00007fa3943bd8cd /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+01013965 _ZN3iqm18PackagedTaskPlugin7executeEP15inWorkSpace_tag+00000093 [ 79] 0x00007fa39439c54f /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00877903 [ 80] 0x00007fa394365252 /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00651858 [ 81] 0x00007fa394365ba3 /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00654243 [ 82] 0x00007fa394365ea4 /sw/external/matlab/2024a/bin/glnxa64/libmwiqm.so+00655012 [ 83] 0x00007fa39422f17e /sw/external/matlab/2024a/bin/glnxa64/libmwmcr.so+00782718 [ 84] 0x00007fa39422ed75 /sw/external/matlab/2024a/bin/glnxa64/libmwmcr.so+00781685 [ 85] 0x00007fa39422efcd /sw/external/matlab/2024a/bin/glnxa64/libmwmcr.so+00782285 [ 86] 0x00007fa3ab6fc277 /sw/external/matlab/2024a/bin/glnxa64/libmwboost_thread.so.1.78.0+00045687 [ 87] 0x00007fa3ac2011ca /lib64/libpthread.so.0+00033226 [ 88] 0x00007fa3abc7be73 /lib64/libc.so.6+00237171 clone+00000067 This error was detected while a MEX-file was running. If the MEX-file is not an official MathWorks function, please examine its source code for errors. Please consult the External Interfaces Guide for information on debugging MEX-files. ** This crash report has been saved to disk as /u/trialan/matlab_crash_dump.3009653-1 ** MATLAB is exiting because of fatal error /var/spool/slurmd/job5572661/slurm_script: line 34: 3009653 Killed matlab -batch "addpath('/u/trialan/Geometric-eigenmodes/forward_fNIRS'); plo


The output log is:

Job started on node: gpub038 at Mon Nov 18 05:52:21 CST 2024Starting MATLAB script execution...Hello friends============================= GPU Information ================================Device 1 of 1: NVIDIA A40Compute Capability: 8.6Global Memory: 47608692736 BConstant Memory: 65536 BShared Memory: 49152 BRegisters: 65536Clock Speed: 1.74 GHzNumber of SMs: 84Number of Cores: 5376Auto-thread: 344064Auto-block: 64ans = struct with fields: name: 'NVIDIA A40' id: 1 devcount: 1 major: 8 minor: 6 globalmem: 4.7609e+10 constmem: 65536 sharedmem: 49152 regcount: 65536 clock: 1740000 sm: 84 core: 5376 autoblock: 64 autothread: 344064 maxgate: 0 base config set, loading data Data has been loaded About to run mcxlab Launching MCXLAB - Monte Carlo eXtreme for MATLAB & GNU Octave ... Running simulations for configuration #1 ... mcx.gpuid=1; mcx.autopilot=1; mcx.seed=1648335518; mcx.srctype='gaussian'; mcx.unitinmm=1; mcx.bc='rrrrrr'; mcx.nphoton=1e+08; mcx.tstart=0; mcx.tend=1e-09; mcx.tstep=5e-12; mcx.issrcfrom0=1; mcx.isspecular=0; mcx.isreflect=0; mcx.isgpuinfo=1; mcx.dim=[192 256 256]; mcx.mediabyte=1; mcx.medianum=6; mcx.srcpos=[96.3071 68.0036 55.9984 1]; mcx.extrasrclen=15; mcx.srcdir=[-0 0.707107 0.707107 0]; mcx.extrasrclen=15; mcx.detnum=24; ============================= GPU Information ================================ Device 1 of 1: NVIDIA A40 Compute Capability: 8.6 Global Memory: 47608692736 B Constant Memory: 65536 B Shared Memory: 49152 B Registers: 65536 Clock Speed: 1.74 GHz Number of SMs: 84 Number of Cores: 5376 Auto-thread: 344064 Auto-block: 64 ############################################################################### # Monte Carlo eXtreme (MCX) -- CUDA # # Copyright (c) 2009-2024 Qianqian Fang <q.fang at neu.edu> # # https://mcx.space/ & https://neurojson.io/ # # # # Computational Optics & Translational Imaging (COTI) Lab- http://fanglab.org # # Department of Bioengineering, Northeastern University, Boston, MA, USA # ############################################################################### # The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365 # ############################################################################### # Open-source codes and reusable scientific data are essential for research, # # MCX proudly developed human-readable JSON-based data formats for easy reuse.# # # #Please visit our free scientific data sharing portal at https://neurojson.io/# # and consider sharing your public datasets in standardized JSON/JData format # ############################################################################### $Rev::188338$v2024.6 $Date::2024-11-13 00:00:36 -05$ by $Author::Qianqian Fang$ ############################################################################### - code name: [Jumbo Jolt] compiled by nvcc [7.5] for CUDA-arch [350] on [Nov 18 2024] - compiled with: RNG [xorshift128+] with Seed Length [4] GPU=1 (NVIDIA A40) threadph=290 extra=221440 np=100000000 nthread=344064 maxgate=200 repetition=1 initializing streams ... init complete : 20 ms requesting 3584 bytes of shared memory launching MCX simulation for time window [0.00e+00ns 1.00e+00ns] ... simulation run# 1 ... kernel complete: 1855 ms retrieving fields ... detected 229928 photons, total: 229928 transfer complete: 11019 ms normalizing raw data ... source 1, normalization factor alpha=2000.000000 data normalization complete : 11310 ms simulated 100000000 photons (100000000) with 344064 threads (repeat x1) MCX simulation speed: 57971.01 photon/ms total simulated energy: 100000000.00 absorbed: 23.80213%

If it helps, I have access to large numbers of GPUs through the cluster I use, but I don't get the impression this is the root cause.

Again, great package, and thanks for a super helpful google group!

Regards,
Thomas Rialan

Fang, Qianqian

unread,
Nov 18, 2024, 11:22:46 PM11/18/24
to mcx-users
Hi Thomas, 

thanks for reporting this.

I took a look at the issue you reported, and was able to reproduce the crash on a A100-40GB GPU. Yes, the GPU memory was only 50% utilized (10GB Nx*Ny*Nz*Nt*sizeof(float), *2 = 20GB due to the use of double-buffer to reduce error shown in https://github.com/fangq/mcx/issues/41). What was causing the crash were due to two reasons

  1. The output data buffer length variable in various units is an "int" type, which can not have a large value more than 2^31-1
  2. Our previous released mcxlab .mex file built on Linux was compiled using an old version of MATLAB, which uses 32bit int for array dimensions (mwSize); this was changed to uint64 at some point.
Because of both issues, the data buffer prepared to receive the GPU output did not have the correct length.

I created an issue at https://github.com/fangq/mcx/issues/235, and was able to commit a fix  - please checkout the github action CI build and see if it has been resolved.

On a related note - I have to emphasize while it is feasible to create many time gates or using high resolution voxelated domains, you have to be aware that MC is ultimately a game of stochastic noise; increasing output data buffer voxel size while keeping the total simulated photon count the same will just result in noisy data - it is just a trade-off.

we have prepared some general best practices guide in the below document. of course, our use case may be different.



Qianqian



From: mcx-...@googlegroups.com <mcx-...@googlegroups.com> on behalf of Thomas Rialan <thomas...@gmail.com>
Sent: Monday, November 18, 2024 7:17 AM
To: mcx-users <mcx-...@googlegroups.com>
Subject: [mcx-users] segmentation violation error at low timestep, but GPU memory not at full utilisation
 
--
You received this message because you are subscribed to the Google Groups "mcx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mcx-users+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/mcx-users/0f4a95f4-5d82-4919-9b8a-d7e1bd963e4bn%40googlegroups.com.

Thomas Rialan

unread,
Nov 19, 2024, 4:17:19 AM11/19/24
to mcx-users

Thank you very much Dr. Fang, your fix has worked perfectly. However it seems like the best practices guide isn't open access, when I try to open it I get this error. 

Notebook not found
There was an error loading this notebook. Ensure that the file is accessible and try again.
https://github.com/COTILab/MCX24Workshop/blob/master/Training/MCX2024_1C_mcx_command_line.ipynb
Fetch for https://api.github.com/repos/COTILab/MCX24Workshop/contents/Training?per_page=100&ref=master failed: { "message": "Not Found", "documentation_url": "https://docs.github.com/rest/repos/contents#get-repository-content", "status": "404" } CustomError: Fetch for https://api.github.com/repos/COTILab/MCX24Workshop/contents/Training?per_page=100&ref=master failed: { "message": "Not Found", "documentation_url": "https

My collaborator was also not able to open it so I think it's not specific to my account.

"On a related note - I have to emphasize while it is feasible to create many time gates or using high resolution voxelated domains, you have to be aware that MC is ultimately a game of stochastic noise; increasing output data buffer voxel size while keeping the total simulated photon count the same will just result in noisy data - it is just a trade-off."

Do you mean by this that I should consider a much larger photon count? This is also something I've thought about, I haven't yet done the calculations for establishing an acceptable SNR. 

Many thanks,
Thomas Rialan


Fang, Qianqian

unread,
Nov 19, 2024, 12:08:52 PM11/19/24
to mcx-users

> Do you mean by this that I should consider a much larger photon count? This is also something I've thought about, I haven't yet done the calculations for establishing an acceptable SNR. 

In the end, it is the number of photons that determines the overall MC output SNR - splitting the results into many time gates or many voxels won't change the overall SNR, trading higher spatial/temporal resolution with noisier per-voxel data.

Hope this makes sense.

If you want to see other ways to speed up mcx simulations, please check out my replies in another thread



Qianqian


Sent: Tuesday, November 19, 2024 4:17 AM
To: mcx-users <mcx-...@googlegroups.com>
Subject: Re: [mcx-users] segmentation violation error at low timestep, but GPU memory not at full utilisation
 
Reply all
Reply to author
Forward
0 new messages