Different results

Ali Fatoorechi

unread,

Jun 9, 2021, 11:22:27 AM6/9/21

to Accelerad Users

Hi Guys,

This is my first time trying Accelerad. I am getting different (consistently lower) results from Accelerad_rtrace compare to the cpu version.

Here is the link to the model. Please let me know if you had problem downloading the files. The model is a single room with a glazing.

And here are the parameters I use:

rtrace -w -h -I -ab 10 -ad 40000 -as 60 -aa 0 -lw 0.000025 model.oct < points.txt > results.txt

If I run Accelerad_rtrace with -g- (not running on gpu) it gives identical results to original rtrace version. But the gpu results are different(always lower). I tried removing the glazing but the results were still lower compare to cpu version.

Any idea?

Best,

Ali

Nathaniel Jones

unread,

Jun 9, 2021, 9:47:06 PM6/9/21

to Accelerad Users

Hi Ali,

There is an issue with your settings. Currently, you have lw=1/ad, which is correct when aa>0, but because you are turning off ambient caching by setting aa=0, you need a smaller lw. Ideally, you could use lw=1/ad^2, but this could be excessive given your settings. I set lw=1e-8 and got the same results with both Accelerad and the CPU version.

Incidentally, what's happening is that due to your lw setting, Radiance is sending only a single diffuse sampling ray from each secondary bounce. The ray directions are pseudorandom, so on average they sample the entire scene and produce a pretty accurate result. Because Accelerad computes rays in parallel, it's pseudorandom sequence is a little less random. (This is on purpose in order to maintain coherence among rays.) Unfortunately, this doesn't work well when there is only a single sample used to calculate the diffuse lighting at each point, and in your model this happens to result in undersampling the window.

Nathaniel

Ali Fatoorechi

unread,

Jun 10, 2021, 7:18:32 AM6/10/21

to Accelerad Users

Hi Nathaniel,

Thank you, that sorted out the discrepancy between the results in gpu & cpu versions.

However with these parameters it takes ~150s for 288 points (compare to ~35s on cpu version) on this relatively simple model , below is the log for the gpu version.

I wonder if I am missing a hint / parameters to set to speed it up.

accelerad_rtrace -I -ab 10 -ad 40000 -as 60 -aa 0 -lw 0.00000001 model.oct < points.txt > results.txt

accelerad_rtrace: OptiX 6.8.0 found display driver 466.27, CUDA driver 11.3.0, and 1 GPU device:

accelerad_rtrace: Device 0: NVIDIA GeForce GTX 1060 6GB with 10 multiprocessors, 1024 threads per block, 1835000 Hz, 6442450944 bytes global memory, 1048576 hardware textures, compute capability 6.1, timeout enabled, Tesla compute cluster driver disabled, PCI 0000:01:00.0.

accelerad_rtrace: Geometry build time: 35 milliseconds for 108 objects.

accelerad_rtrace: OptiX kernel 0 time: 150401 milliseconds (150 seconds).

accelerad_rtrace: ray tracing time: 150620 milliseconds (150 seconds).

Thank you,

Ali

Ali Fatoorechi

unread,

Jun 10, 2021, 9:24:28 AM6/10/21

to Accelerad Users

Just tried it again on ~18k calc points on a more complex model but it didn't finish within an hour, so I canceled the task..

Nathaniel Jones

unread,

Jun 10, 2021, 9:29:25 AM6/10/21

to Accelerad Users

Hi Ali,

Typically, you need at least few thousand sensor points in order to see a speedup from Accelerad. Your model with 288 points isn't going to benefit much from GPU parallelism.

There are a few settings that you could change in order to make Accelerad perform better relative to CPU computation. Setting lr=10 instead of the default lr=-10 will turn off Russian roulette, which will increase coherence of the GPU threads but also slow the CPU calculation tremendously. Other strategies could be to enable irradiance caching by setting aa=0.1, reduce the number of ambient bounces, or reduce the ambient divisions, all of which are set with much more accuracy than you probably need for your model.

Nathaniel

Ali Fatoorechi

unread,

Jun 10, 2021, 9:50:12 AM6/10/21

to Accelerad Users

Sorry I am just a bit curios, for even a single point with -ad 40000 there would be enough rays to trace for the gpu threads (assuming 2048 threads per SM) not to be left idle, isn't that correct?

Nathaniel Jones

unread,

Jun 10, 2021, 10:16:04 AM6/10/21

to Accelerad Users

It's not. Parallelism is carried out at the outermost level, which is to say one thread per primary ray. So if you have 2048 threads per SM, then you would need 2048 primary rays to fill them, and each thread would trace all 40000 diffuse rays sequentially.

Things change if you enable irradiance caching by setting aa>0. If you do this, then the cache will be created before the primary rays are traced, and it will be created in parallel at each level, so a single point with 40000 diffuse rays would indeed fill a 2048-thread SM several times over. For details, see this paper.

However, if you only have a single point, chances are the computation would finish on the CPU faster than the program and geometry loaded onto the GPU, so there would be no point in doing so.

Nathaniel

Ali Fatoorechi

unread,

Jun 10, 2021, 11:58:56 AM6/10/21

to Accelerad Users

Having -aa 0, I just ran ~18000 calc points (on a model with ~12k triangles) on RTX 2060 and it finished in 855 seconds. Would you expect a similar time as this ?

"... one thread per primary ray... and each thread would trace all 40000 diffuse rays sequentially.

So each thread processes a single calc-point ? Is it not possible to instead distribute the 40k rays(or 40k x calc-points) between all the threads that is one ray per thread ? this way you wouldn't end up with idle threads doing nothing.

Nathaniel Jones

unread,

Jun 10, 2021, 12:33:24 PM6/10/21

to Accelerad Users

Not knowing the specifics of your model or settings, I won't comment on how long the calculation should take. However, the settings you showed previously indicated a higher level of accuracy than is typical, so you should probably do some sensitivity analysis to determine what settings are reasonable.

Sure, you could distribute your 40k diffuse rays across multiple threads by introducing a small positive aa value. This also requires that you set an appropriate ac value.

If you keep aa=0, then each thread processes a single line of input in rtrace, or a single pixel in rpict. You could specify the individual rays you want to trace from each sensor point as input to rtrace, and then manually combine them, similar to the -c option in rcontrib and rfluxmtx. This can be automated using cnt and rcalc to generate a random set of 40k rays per point using Shirley-Chiu sampling as implemented in disk2square.cal, then running rtrace without the -I option, and reducing the results to the original number of sensor points using total.

Nathaniel

Ali Fatoorechi

unread,

Jun 11, 2021, 4:56:56 AM6/11/21

to Accelerad Users

Thanks for clarifying it,

And I guess this is true for rfluxmtx program where ambient caching is off (-aa 0) by default ? And it process a single line of input (sensor point) ?

Nathaniel Jones

unread,

Jun 11, 2021, 7:13:37 AM6/11/21

to Accelerad Users

The rfluxmtx program is a wrapper for rcontrib, and it's job is to create the input for rcontrib. By default, this input is 10k lines when a sender file is used, resulting in 10k threads when run on the GPU. You can change this using the -c option. If no sender file is given, then it is up to the user to specify the number of primary rays through the standard input.

Basically, rfluxmtx is an automating tool for rcontrib which is capable of generating a large number of rays to be run in parallel. The point I made before was rtrace has no equivalent automating tool, but you can make one using cnt, rcalc, and total.

Nathaniel

Ali Fatoorechi

unread,

Jun 11, 2021, 8:17:34 AM6/11/21

to Accelerad Users

OK that is a good point.

I guess normally for illuminance calculation the sender is left empty and only '-' is used. Or at least that's how I understood it from the main tutorials on climate calcs.
And so again by default rfluxmtx process a single sensor point at a time and doesn't utilize the full gpu resources, is that correct?

I am not sure about the way rfluxmtx and rtrace work but my point is that in general you would get massive increase in speed if you instead distribute all the rays and send them to gpu. I have had both approach implemented in cuda before and the difference is huge at least for a blind monte-carlo method.

Nathaniel Jones

unread,

Jun 11, 2021, 8:56:58 AM6/11/21

to Accelerad Users

The tutorials show a simplified case. In reality, you would most likely provide a list of sensor locations to rfluxmtx when using '-', and not just a single point. If the list is small, I agree there is not much to be gained by using a GPU.

You certainly can distribute rays across all of your GPU cores. I've mentioned two ways to do it at this point. However, the implementation is up to you.

Nathaniel

Ali Fatoorechi

unread,

Jun 14, 2021, 8:00:11 AM6/14/21

to Accelerad Users

Hi Nathaniel,

Thanks for your reply. I tried your suggestion to send sampled rays instead and later aggregate the rays result.

-so I removed '-I' from the command args :

rtrace -ab 10 -ad 40000 -as 60 -aa 0 -lw 0.00000001 model.oct < rays.txt > results.txt

and used rays.txt as the input file which has 40114 rays uniformly sampled for a single point in xorg yorg zorg xdir ydir zdir format:

25.120000 24.380000 1.250000 0.038939 0.000568 0.999241

25.120000 24.380000 1.250000 0.049741 0.001246 0.998761

25.120000 24.380000 1.250000 0.042350 0.001580 0.999102

...

However it rather took much longer to finish (85s on my PC). So I guess I have used a wrong parameter or didn't do it correctly. Could you shed some light? Thank you.

The files are available in this link if needed.

Nathaniel Jones

unread,

Jun 14, 2021, 10:19:39 AM6/14/21

to Accelerad Users

Hi Ali,

Yes, there are some other settings you need to change. Otherwise, you are just increasing the number of primary rays from 288 in your previous example to 40114 now.

Essentially, this approach removes the first bounce of the virtual sensor plane, so the other simulation parameters should change to correspond to values for the second bounce. So if you want to match the accuracy of your earlier test:

ab_new = ab_old - 1
ad_new = ad_old/2
as_new = as_old*lw_new/lw_old
lw_new = min(lw_old*ad_old, 1)
lr_new = (abs(lr_old) - 1)*sign(lr_old)

So your new parameters become:

rtrace -ab 9 -ad 20000 -as 0 -aa 0 -lr -9 -lw 4e-4 model.oct < rays.txt > results.txt

To clarify, you said you were sampling uniformly at a single point. You want your sample to be cosine-weighted, not uniform. Again, I suggest Shirley-Chiu sampling, which is what Radiance does, and which is helpfully provided for you in disk2square.cal.

Nathaniel

Ali Fatoorechi

unread,

Jun 16, 2021, 5:08:18 AM6/16/21

to Accelerad Users

Hi Nathaniel,

You are right and I said wrongly that rays are sampled uniformly They actually are sampled cos-weighted on the file I sent.

I tried rtrace with the new params but it didn't increase the speed.

So here are the results for 3 scenarios ran in GeForce 1060:

1. rtrace cpu (6 cores) with 288 input sensors finished in ~30s

rtrace -I -ab 10 -ad 40000 -as 60 -aa 0 -lw 0.000025 model.oct < points.txt > results.txt

2. rtrace gpu with 288 input sensors finished in 154s

rtrace -I -ab 10 -ad 40000 -as 60 -aa 0 -lw 0.00000001 model.oct < points.txt > results.txt

3. rtrace gpu with 800000 rays (20 sensors x 40000 rays) finished in 108s

rtrace -ab 9 -ad 20000 -as 0 -aa 0 -lr -9 -lw 4e-4 model.oct < rays.txt > results.txt

I only send rays for 20 sensors on the 3rd scenario otherwise it gives outofmemory exception. So I think I have to send rays in few iterations. Nevertheless it is much slower compare to 2.

Any suggestion on this?

Also on a bigger model (~12k triangles) and ~18k sensors, running rtrace with params from the 1st scenario, the cpu(6 cores) finished in 386s compare to 3860s for gpu version; 10x slower.

Looking at your papers, I tend to conclude that you don't gain much from gpu with -aa 0 ? Or at least the results I get doesn't suggest that.

I also ran same tests on a graphic cards with RT cores(RTX 2080) and the time was then comparable to the cpu(again with 6 cores) version. So it seems non RT cards hits the performance a lot.

Nathaniel Jones

unread,

Jun 16, 2021, 10:23:17 PM6/16/21

to Accelerad Users

Hi Ali,

Thanks for providing the timings. I'm a little confused by the comparison you're making, though. It looks as if you're using different parameters for the CPU and GPU. You have lw=2.5e-5 for the CPU, and 1e-8 for the GPU. So the GPU takes 5 times longer, but it is tracing 20000 times more rays? For a fair comparison, you should use the same settings for both. However, if you're getting the same results on the CPU with both lw settings, then perhaps you could reduce some other parameter settings such as -ad and -ab and still maintain accuracy while speeding up the simulation.

I'm also a unclear on the environment you're using for your CPU comparison. You mentioned that you are using 6 cores, but I don't see where you have set -n.

The tests that I've published generally make comparisons to models that take minutes or hours to run on CPUs. Your case takes a few seconds to run on the CPU, so it's not the type of simulation that I've put much effort into speeding up. However, for timings on cases with -aa 0, you could take a look at this paper.

Nathaniel

Ali Fatoorechi

unread,

Jun 17, 2021, 6:32:20 AM6/17/21

to Accelerad Users

Hi Nathaniel,

Thank you for the reply,

Yes I used different lw for cpu and gpu cases and that is based on our earlier conversion. I use lw=1/ad for cpu but this doesn't give same results in gpu version. (always lower) So to get same results you suggested to try 1e-8 for reasons that you explained earlier. I obviously don't want to sacrifice the accuracy for the speed if it is not giving accurate results.

I work on Windows platform and -n doesn't work on Windows. So I split and distribute the sensors between multiple cores.

Yes 288 sensors aren't a lot but at least I expect the gpu to show similar time. And surely 18k sensors should be enough to see speed up from gpu but as I said this was 10x slower (than running on cpu).

Nathaniel Jones

unread,

Jun 17, 2021, 8:55:09 AM6/17/21

to Accelerad Users

Hi Ali,

I might have been unclear in my first response. When you use -aa 0, Radiance has different behavior than with positive -aa settings, and the lw=1/ad setting doesn't apply in this case. So you should be using -lw 1e-8 for both CPU and GPU calculations. You should use the CPU timing with -lw 1e-8 for comparison against Accelerad.

Nathaniel

Reply all

Reply to author

Forward