my previous comment on "exact reproducibility is not guaranteed" is for the output fluence - because accumulation of floating point numbers from multi-threads may produce different outputs due to the execution order (i.e. limited precision floating-point summation is not commutable). However, that comment is not true for RNG seeds. The host/CPU side random number seeds are expected to reproducible; as a result, the RNG seed used for each photon inside each thread are also supposed to be exactly reproducible.
--
You received this message because you are subscribed to the Google Groups "mcx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mcx-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mcx-users/9799a093-5ba4-455d-891e-9cb947c1583bn%40googlegroups.com.
it should not be called an "error" - it is the expected (and
frequently observed) behavior of floating-point operations in a
multi-threading environment.
the source of the problem is because MC output is achieved by accumulating small floating-point numbers at each voxel as every photon packet passes by. Therefore, the output value can be ultimately written as
v = (((vt1 + vt2) + vt3) + vt4) + ...
here t_i refers to the packet simulated by the i-th thread. However, in a multi-threading environment, the order of this summation is not guaranteed, when rerunning the exact computation on the exact the same hardware, the summation could be done as
v = (((vt2 + vt3) + vt4) + vt1) + ...
floating-point has limited precision, and (a+b)+c is different from a+ (b+c) in such a finite precision (i.e. floating-point addition is not associative), therefore, you are expected to produce different results.
the only case where such a result is exactly reproducible is when
the thread scheduler executes threads in an deterministic
logic/order, which can not be guaranteed by most multi-threading
environments. In general, CUDA's runtime has exhibit a relatively
stable thread scheduling scheme, and in some cases, you can see
nearly reproducible results, however, to be exactly reproducible
is not guaranteed.
there has been numerous discussions on this topic, and making
results bit-to-bit reproducible even on the same hardware is quite
difficult and requires manual handling of threads -
__syncthreads() will not be enough because it only asks all
threads to stop at the same point, but does not guarantee their
orders.
http://www.shodor.org/media/content//petascale/materials/UPModules/dynamicProgrammingPartI/bestPractices.pdf
(see Section 7.2.2)
https://annals-csis.org/Volume_5/pliks/86.pdf
https://hal.science/hal-00949355/document
https://www.sciencedirect.com/science/article/abs/pii/S0167819115001155
https://www-pequan.lip6.fr/~jezequel/ARTICLES/article_CANA2015.pdf
https://www.nist.gov/system/files/documents/itl/ssd/is/NRE-2015-07-Nguyen_slides.pdf
Hi Dr. Fang,
Thank you for the prompt response!I used "floating point summation error" to mean "the accumulation of floating point numbers from multi-thread". I apologize. I should have described it better.
because accumulation of floating point numbers from multi-threads may produce different outputs
Best,
Seonyeong
To view this discussion on the web visit https://groups.google.com/d/msgid/mcx-users/4643728e-bbc4-379d-d364-768240f57956%40gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mcx-users/CAJVvFr%2Br_ZuYUKi%2BxW_53pYtqKQno3YGnYfOg%2BrTKxH_2cUuEg%40mail.gmail.com.