Replay issue in demo_example_replay.m

35 views
Skip to first unread message

Xiuxiu Zhang

unread,
Feb 2, 2026, 10:04:18 PMFeb 2
to mmc-users
Dear Dr. Fang, 

I ran demo_example_replay.m on my laptop with an Apple M4 Pro chip. The initial Monte Carlo simulation runs without issue and detects ~3000 photons. However, during the replay step (newcfg), only ~4 photons are detected and the terminal reports that the replay is not successful. I did not modify the sample code. The script still generates the Jacobian figure (attached).

A coworker was able to run the same demo successfully on an NVIDIA GPU, so I was wondering whether this behavior could be related to the Apple Silicon OpenCL backend. Please let me know if there are recommended settings or known limitations for running replay on macOS.

Thank you,
Xiuxiu ZhangScreenshot 2026-02-02 at 9.51.10 PM.pngScreenshot 2026-02-02 at 9.41.23 PM.png

Qianqian Fang

unread,
Feb 3, 2026, 10:45:32 PMFeb 3
to mmc-...@googlegroups.com, Xiuxiu Zhang

hi Xiuxiu,

yes, this behavior is a combination of an recently enabled OpenCL optimization in MMC kernel, and differences in OpenCL thread scheduling (i.e. the orders of the execution of many threads) on different devices.

In the most recent release, v2025.10, I increased the kernel optimization level (cfg.optlevel=3, the optlevels are explained in our MCXCL paper [1] - each level enables extra optimization strategies to improve simulation speed on OpenCL, previously, the default optlevel is 1).

after this optimization, I noticed that for devices that the execution order is more stable/repeatable, such as NVIDIA GPU, the replay is still quite reliable, but for some devices, the replay seems to be sensitive to these additional optimization strategies (especially the block-based workload balance).


there is a simple workaround to this issue, you just need to set cfg.optlevel=1 to reverse to the previous setting.


below is the session I ran on an M2 Pro processor. Initially, using the new default optlevel=3, the replay failed, just like you reported, but after setting newcfg.optlevel=1, the replay is successful.


hope this explanation helps.

Qianqian


fangq@yugong:~/Downloads/mmclab/example$ sw_vers 
ProductName:        macOS
ProductVersion:        14.5
BuildVersion:        23F79
fangq@yugong:~/Downloads/mmclab/example$ sysctl -a machdep.cpu
machdep.cpu.cores_per_package: 10
machdep.cpu.core_count: 10
machdep.cpu.logical_per_package: 10
machdep.cpu.thread_count: 10
machdep.cpu.brand_string: Apple M2 Pro
fangq@yugong:~/Downloads/mmclab/example$ 
fangq@yugong:~/Downloads/mmclab/example$ matlab -nojvm


                                                     < M A T L A B (R) >
                                           Copyright 1984-2024 The MathWorks, Inc.
                                       R2024a Update 3 (24.1.0.2603908) 64-bit (maca64)
                                                         May 2, 2024

 
For online documentation, see https://www.mathworks.com/support
For product information, visit www.mathworks.com.
 
>> 
>> cd ..
>> addpath(pwd)
>> cd example/
>> demo_example_replay             
Launching MMCLAB - Mesh-based Monte Carlo for MATLAB & GNU Octave ...
Running simulations for configuration #1 ...
mmc.nphoton=1e+06;
mmc.seed=1648335518;
mmc.nn=29791;
mmc.elem=[162000,4];
mmc.ne=162000;
mmc.srcpos=[30.1 30.2 0];
mmc.srcdir=[0 0 1 0];
mmc.tstart=0;
mmc.tend=5e-09;
mmc.tstep=5e-09;
mmc.prop=1;
mmc.debuglevel='TP';
mmc.method='elem';
mmc.isreflect=0;
mmc.detnum=4;
mmc.issaveexit=1;
mmc.issaveseed=1;
mmc.facenb=[162000,4];
mmc.evol=162000;
mmc.e0=27466;
    done    7
simulating ... 
###############################################################################
#                     Mesh-based Monte Carlo (MMC) - OpenCL                   #
#          Copyright (c) 2010-2025 Qianqian Fang <q.fang at neu.edu>          #
#              https://mcx.space/#mmc  &  https://neurojson.io                #
#                                                                             #
#Computational Optics & Translational Imaging (COTI) Lab  [http://fanglab.org]#
#   Department of Bioengineering, Northeastern University, Boston, MA, USA    #
###############################################################################
#    The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365      #
###############################################################################
#  Open-source codes and reusable scientific data are essential for research, #
# MCX proudly developed human-readable JSON-based data formats for easy reuse.#
#                                                                             #
#Please visit our free scientific data sharing portal at https://neurojson.io #
# and consider sharing your public datasets in standardized JSON/JData format #
###############################################################################
$Rev::0684c2$v2025.10$Date::2025-10-20 11:05:07 -04$ by $Author::Qianqian Fang$
###############################################################################
- variant name: [MMC-OpenCL] compiled with OpenCL version [1]
- compiled with: [RNG] xorshift128+ RNG [Seed Length] 4
initializing streams ...    init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DUSE_MACRO_CONST -DMCX_SRC_PENCIL   -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DMCX_SAVE_SEED -DUSE_BLBADOUEL -Dgcfgdebuglevel=2560  -Dgcfgdstep=1.0000000000e+00f  -Dgcfge0=27466  -Dgcfgelemlen=4  -Dgcfgfocus=0.0000000000e+00f  -Dgcfgframelen=162000  -Dgcfgisextdet=0  -Dgcfgismomentum=0  -Dgcfgisreflect=0  -Dgcfgissavedet=1  -Dgcfgissaveexit=1  -Dgcfgissaveref=0  -Dgcfgisspecular=0  -Dgcfgmaxdetphoton=1000000  -Dgcfgmaxjumpdebug=10000000  -Dgcfgmaxmedia=1  -Dgcfgmaxgate=1  -Dgcfgmaxpropdet=6  -Dgcfgmethod=3  -Dgcfgminenergy=9.9999999748e-07f  -Dgcfgnormbuf=998  -Dgcfgnout=1.0000000000e+00f  -Dgcfgoutputtype=0  -Dgcfgreclen=9  -Dgcfgroulettesize=1.0000000000e+01f  -DgcfgRtstep=2.0000000000e+08f  -Dgcfgsrcelemlen=0  -Dgcfgsrctype=0  -Dgcfgvoidtime=1  -Dgcfgseed=1648335518  -Dgcfgsrcnum=1  -Dgcfgdetnum=4  -Dgcfgissaveseed=1  -Dgcfgnf=10800 
build program complete : 3 ms
- [device 0(1): Apple M2 Pro] threadph=81 oddphotons=4672 np=1000000.0 nthread=12288 nblock=64 repetition=1
requesting 2816 bytes of shared memory
set kernel arguments complete : 3 ms 3
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 
Progress: [==============================================================] 100%
kernel complete:      1436 ms
retrieving flux ...     detected 2999 photons, total: 2999    transfer complete:        1453 ms
source 1    total simulated energy: 1000000.000000    absorbed: 17.66706%    normalizor=209.381
simulated 1000000 photons (1000000) with 1 devices (ray-tet 228338960)
MCX simulation speed: 697.84 photon/ms
total simulated energy: 1000000.00    absorbed: 17.66706%
(loss due to initial specular reflection is excluded in the total)
    done    1494
Launching MMCLAB - Mesh-based Monte Carlo for MATLAB & GNU Octave ...
Running simulations for configuration #1 ...
mmc.nphoton=1e+06;
mmc.nphoton=2999;
mmc.nn=29791;
mmc.elem=[162000,4];
mmc.ne=162000;
mmc.srcpos=[30.1 30.2 0];
mmc.srcdir=[0 0 1 0];
mmc.tstart=0;
mmc.tend=5e-09;
mmc.tstep=5e-09;
mmc.prop=1;
mmc.debuglevel='TP';
mmc.method='elem';
mmc.isreflect=0;
mmc.detnum=4;
mmc.issaveexit=1;
mmc.issaveseed=1;
mmc.facenb=[162000,4];
mmc.evol=162000;
mmc.e0=27466;
mmc.replaydet=1;
mmc.detphotons=[10 2999];
mmc.isnormalized=0;
mmc.outputtype='wl';
    done    13
simulating ... 
###############################################################################
#                     Mesh-based Monte Carlo (MMC) - OpenCL                   #
#          Copyright (c) 2010-2025 Qianqian Fang <q.fang at neu.edu>          #
#              https://mcx.space/#mmc  &  https://neurojson.io                #
#                                                                             #
#Computational Optics & Translational Imaging (COTI) Lab  [http://fanglab.org]#
#   Department of Bioengineering, Northeastern University, Boston, MA, USA    #
###############################################################################
#    The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365      #
###############################################################################
#  Open-source codes and reusable scientific data are essential for research, #
# MCX proudly developed human-readable JSON-based data formats for easy reuse.#
#                                                                             #
#Please visit our free scientific data sharing portal at https://neurojson.io #
# and consider sharing your public datasets in standardized JSON/JData format #
###############################################################################
$Rev::0684c2$v2025.10$Date::2025-10-20 11:05:07 -04$ by $Author::Qianqian Fang$
###############################################################################
- variant name: [MMC-OpenCL] compiled with OpenCL version [1]
- compiled with: [RNG] xorshift128+ RNG [Seed Length] 4
initializing streams ...    init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DUSE_MACRO_CONST -DMCX_SRC_PENCIL   -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DUSE_BLBADOUEL -Dgcfgdebuglevel=2560  -Dgcfgdstep=1.0000000000e+00f  -Dgcfge0=27466  -Dgcfgelemlen=4  -Dgcfgfocus=0.0000000000e+00f  -Dgcfgframelen=162000  -Dgcfgisextdet=0  -Dgcfgismomentum=0  -Dgcfgisreflect=0  -Dgcfgissavedet=1  -Dgcfgissaveexit=1  -Dgcfgissaveref=0  -Dgcfgisspecular=0  -Dgcfgmaxdetphoton=1000000  -Dgcfgmaxjumpdebug=10000000  -Dgcfgmaxmedia=1  -Dgcfgmaxgate=1  -Dgcfgmaxpropdet=6  -Dgcfgmethod=3  -Dgcfgminenergy=9.9999999748e-07f  -Dgcfgnormbuf=998  -Dgcfgnout=1.0000000000e+00f  -Dgcfgoutputtype=4  -Dgcfgreclen=9  -Dgcfgroulettesize=1.0000000000e+01f  -DgcfgRtstep=2.0000000000e+08f  -Dgcfgsrcelemlen=0  -Dgcfgsrctype=0  -Dgcfgvoidtime=1  -Dgcfgseed=4294966297  -Dgcfgsrcnum=1  -Dgcfgdetnum=4  -Dgcfgissaveseed=0  -Dgcfgnf=10800 
build program complete : 3 ms
- [device 0(1): Apple M2 Pro] threadph=0 oddphotons=702 np=702.0 nthread=12288 nblock=64 repetition=1
requesting 2816 bytes of shared memory
set kernel arguments complete : 3 ms 3
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 
Progress: [==============================================================] 100%
kernel complete:      234 ms
retrieving flux ...     detected 2 photons, total: 2    transfer complete:        261 ms
simulated 702 photons (702) with 1 devices (ray-tet 175748)
MCX simulation speed: 3.04 photon/ms
total simulated energy: 702.00    absorbed: 19.15085%
(loss due to initial specular reflection is excluded in the total)
    done    294
replay failed :-(
Error using figure
This functionality is not supported under the -nojvm startup option.

Error in demo_example_replay (line 66)
figure;
 
>> 
>> newcfg.optlevel=1;              
>> 
>> [cube3, detp3] = mmclab(newcfg);
Launching MMCLAB - Mesh-based Monte Carlo for MATLAB & GNU Octave ...
Running simulations for configuration #1 ...
mmc.nphoton=1e+06;
mmc.nphoton=2999;
mmc.nn=29791;
mmc.elem=[162000,4];
mmc.ne=162000;
mmc.srcpos=[30.1 30.2 0];
mmc.srcdir=[0 0 1 0];
mmc.tstart=0;
mmc.tend=5e-09;
mmc.tstep=5e-09;
mmc.prop=1;
mmc.debuglevel='TP';
mmc.method='elem';
mmc.isreflect=0;
mmc.detnum=4;
mmc.issaveexit=1;
mmc.issaveseed=1;
mmc.facenb=[162000,4];
mmc.evol=162000;
mmc.e0=27466;
mmc.replaydet=1;
mmc.detphotons=[10 2999];
mmc.isnormalized=0;
mmc.outputtype='wl';
mmc.optlevel=1;
    done    17
simulating ... 
###############################################################################
#                     Mesh-based Monte Carlo (MMC) - OpenCL                   #
#          Copyright (c) 2010-2025 Qianqian Fang <q.fang at neu.edu>          #
#              https://mcx.space/#mmc  &  https://neurojson.io                #
#                                                                             #
#Computational Optics & Translational Imaging (COTI) Lab  [http://fanglab.org]#
#   Department of Bioengineering, Northeastern University, Boston, MA, USA    #
###############################################################################
#    The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365      #
###############################################################################
#  Open-source codes and reusable scientific data are essential for research, #
# MCX proudly developed human-readable JSON-based data formats for easy reuse.#
#                                                                             #
#Please visit our free scientific data sharing portal at https://neurojson.io #
# and consider sharing your public datasets in standardized JSON/JData format #
###############################################################################
$Rev::0684c2$v2025.10$Date::2025-10-20 11:05:07 -04$ by $Author::Qianqian Fang$
###############################################################################
- variant name: [MMC-OpenCL] compiled with OpenCL version [1]
- compiled with: [RNG] xorshift128+ RNG [Seed Length] 4
initializing streams ...    init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL   -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DUSE_BLBADOUEL
build program complete : 4 ms
- [device 0(1): Apple M2 Pro] threadph=0 oddphotons=702 np=702.0 nthread=12288 nblock=64 repetition=1
requesting 2816 bytes of shared memory
set kernel arguments complete : 4 ms 3
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... 
Progress: [==============================================================] 100%
kernel complete:      236 ms
retrieving flux ...     detected 702 photons, total: 702    transfer complete:        263 ms
simulated 702 photons (702) with 1 devices (ray-tet 243296)
MCX simulation speed: 3.03 photon/ms
total simulated energy: 702.00    absorbed: 35.66643%
(loss due to initial specular reflection is excluded in the total)
    done    298

On 2/2/26 22:04, Xiuxiu Zhang wrote:
Dear Dr. Fang, 

I ran demo_example_replay.m on my laptop with an Apple M4 Pro chip. The initial Monte Carlo simulation runs without issue and detects ~3000 photons. However, during the replay step (newcfg), only ~4 photons are detected and the terminal reports that the replay is not successful. I did not modify the sample code. The script still generates the Jacobian figure (attached).

A coworker was able to run the same demo successfully on an NVIDIA GPU, so I was wondering whether this behavior could be related to the Apple Silicon OpenCL backend. Please let me know if there are recommended settings or known limitations for running replay on macOS.

Thank you,
Xiuxiu Zhang

--
You received this message because you are subscribed to the Google Groups "mmc-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mmc-users+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/mmc-users/dc1625c5-874f-420d-b459-83e17a960b45n%40googlegroups.com.

Qianqian Fang

unread,
Feb 3, 2026, 10:52:12 PMFeb 3
to mmc-...@googlegroups.com, Xiuxiu Zhang

sorry, forgot to provide the MCXCL paper link

[1] Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang*, "Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms," J. Biomed. Opt. 23(1), 010504 (2018).

https://doi.org/10.1117/1.JBO.23.1.010504


a quick way to tell whether you are using optlevel=3 or 1 is to see the printed log, particularly the line starting with "Building kernel with option", if you see a long line with many macros, you are using optlevel=3


Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DUSE_MACRO_CONST -DMCX_SRC_PENCIL   -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DMCX_SAVE_SEED -DUSE_BLBADOUEL -Dgcfgdebuglevel=2560  -Dgcfgdstep=1.0000000000e+00f  -Dgcfge0=27466  -Dgcfgelemlen=4  -Dgcfgfocus=0.0000000000e+00f  -Dgcfgframelen=162000  -Dgcfgisextdet=0  -Dgcfgismomentum=0  -Dgcfgisreflect=0  -Dgcfgissavedet=1  -Dgcfgissaveexit=1  -Dgcfgissaveref=0  -Dgcfgisspecular=0  -Dgcfgmaxdetphoton=1000000  -Dgcfgmaxjumpdebug=10000000  -Dgcfgmaxmedia=1  -Dgcfgmaxgate=1  -Dgcfgmaxpropdet=6  -Dgcfgmethod=3  -Dgcfgminenergy=9.9999999748e-07f  -Dgcfgnormbuf=998  -Dgcfgnout=1.0000000000e+00f  -Dgcfgoutputtype=0  -Dgcfgreclen=9  -Dgcfgroulettesize=1.0000000000e+01f  -DgcfgRtstep=2.0000000000e+08f  -Dgcfgsrcelemlen=0  -Dgcfgsrctype=0  -Dgcfgvoidtime=1  -Dgcfgseed=1648335518  -Dgcfgsrcnum=1  -Dgcfgdetnum=4  -Dgcfgissaveseed=1  -Dgcfgnf=10800 


if you use optlevel=1, you will see only a much shorter list of options

Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL   -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DUSE_BLBADOUEL


Xiuxiu Zhang

unread,
Feb 4, 2026, 1:57:22 PMFeb 4
to mmc-users
Thanks so much for the guidance, Dr. Fang. The replay works on my end now. 

One other thing I noticed in the demo replay code: https://github.com/fangq/mmc/blob/master/mmclab/example/demo_example_replay.m--------
Before the reply success check in line 59, you might want to also add:  detp.data = detp.data(:, detp.data(1, :) == newcfg.replaydet) to make sure its only check detp.data for detector #1. You had this line for wp replay but not for wl replay. Let me know if I am understanding the code incorrectly.

Best,
Xiuxiu

Xiuxiu Zhang

unread,
Apr 7, 2026, 11:19:03 PM (9 days ago) Apr 7
to mmc-users
Hi Dr. Fang,

I wanted to follow up on replay mode and ask about the difference between newcfg.outputtype = 'wl' and newcfg.outputtype = 'jacobian'. My understanding is that 'wl' returns the weighted pathlength used to construct the mua Jacobian, while 'jacobian' directly outputs the mua Jacobian. However, when I run MMCLAB with these two settings and assign cube2 and detp2 as the first two outputs, I obtain identical results for both cases. I was wondering whether these two output types are expected to produce different results, or if I might be overlooking where the difference appears.

Additionally, for calculating the differential pathlength factor (DPF) from my Monte Carlo simulations, is there a way to incorporate the weighted pathlength? If so, could you clarify where this quantity is stored in the outputs?

Thanks,
Xiuxiu
On Tuesday, February 3, 2026 at 10:52:12 PM UTC-5 q.fang wrote:

Xiuxiu Zhang

unread,
Apr 8, 2026, 4:46:19 PM (9 days ago) Apr 8
to mmc-users
Hi Dr. Fang, 

Just following up on my previous email. I also have the trajectory data, but it appears that it records every photon (up to ~1e7 by default). I was wondering if there is a way to retrieve only the detected photon weights instead.

Additionally, for rows 2–4 in the trajectory output, could you clarify what the trajectory positions represent? Do these correspond to the photon locations prior to complete absorption?

Thanks,
Xiuxiu

Qianqian Fang

unread,
Apr 10, 2026, 10:22:09 AM (7 days ago) Apr 10
to mmc-...@googlegroups.com, Xiuxiu Zhang

hi Xiuxiu,

historically, jacobian and wl outputs are implemented separately in the CPU-only code (mmc_raytrace.c) with a subtle difference:

when outputtype='jacobian', mmc accumulates w*exp(-DELTA_MUA*L), where w is the detected weight of the replayed photon, and then divide the constant DELTA_MUA=1e-4 at the normalization stage; as you can see, this is a discrete approximation to the derivative (w.r.t mua), i.e

output = sum(w*exp(-delta_mua*L))/delta_mua


when the outputtype is 'wl', mmc directly accumulates -w*L

output = sum(-w*L)


see source code here

https://github.com/fangq/mmc/blob/v2025.10/src/mmc_raytrace.c#L356-L364


when delta_mua is small (which is the case), you can see that the Taylor's expansion of the 'jacobian' output form, exp(...), gives -w*delta_mua*L, divided by constant delta_mua at the end resulting in sum(-w*L), which is the same as 'wl'.


you can see some small difference when you use CPU/SSE backend (-G -1, or cfg.gpuid=-1), but they are numerically equivalent.


because of that, in the GPU version, for either output types, we only accumulates -w*L, this makes 'jacobian' an alias to `wl`, as you have observed.


> Additionally, for calculating the differential pathlength factor (DPF) from my Monte Carlo simulations, is there a way to incorporate the weighted pathlength? If so, could you clarify where this quantity is stored in the outputs?

the output type 'length' was added to mcx since v2023 in this commit

https://github.com/fangq/mcx/commit/e67feed6af164302f34308b91f941d95e932d952

but it has not been ported to mmc. the fluence output should be effectively close to w*L output if you apply the same Taylor's expansion of the exp() above to the fluence output.


Qianqian

Reply all
Reply to author
Forward
0 new messages