A GPU test in Win 1

Feng Zhou

unread,

Oct 24, 2017, 9:48:07 AM10/24/17

to gprMax-users

Spending two weeks, eventually, I succeed in running gprMax in GPU in a Windows 10 system.

Firstly, I would like to share the experience of installing software for gprMax GPU computing. It seems LINUX is much easier for GPU software installation than Windows for its gcc complier, but I prefer to use Windows GPU computing because I am not familiar with LINUX operations. To install Visual Studio is my first step. I have tried Visual Studio 2017 and 2015, but I recommend to use Visual Studio 2015 to ensure CUDA Toolkit recognizing VS. I used vs2015.ent_enu.iso in my computer, and I choose to install all components. It is time- and space-consuming, but it is safe for a programming beginner. Then, I installed CUDA toolkit as the gprMax doc guide, I chose to install cuda_9.0.176_win10.exe, with all default installation step. The environment variable settings are very important, but it will be OK if you chose the default installation. Afterwards, I install Miniconda and gprMax and have a test of gprMax, as guiding by the gprMax manual. I have met with some problems in this step, because my web link is very unsteady, which might not be a problem for a person in other countries. The next step is to install pycuda according to the guide. Now, everything is finished, I began to run gprMax in GPU. As discussed in the GitHub, I also met with the problem “The context stack was not empty upon module cleanup”. The main reason is the cl.exe, which should have been included into the environment variable path. I manually add cl.exe directory into user path, and it showed everything is OK. I strongly suggest to pay attention to the environment variable settings if you met with some problems. So many software installed that they can not correlate with each other.

Then, I had a series of tests in my workstation. My work station is configured as:

Dell T7810, one CPU with Intel Xeon E5-2643 v4 @3.4GHz 6 cores 12 threads, memory 32 G, one NVIDIA Quadro K1200 with 4G memory and 512 cuda cores. My test results are as follows:

(1) cylinder_Ascan_2D.in, cell number: 120*120*1

CPU: 41.9 M memory, solver time 0.58 s, simulation time 1.45 s

GPU: 109 M memory, solver time 0.36 s, simulation time 3.57 s

(2) cylinder_Ascan_2D.in, increased cell number: 5000*5000*1=2500*e4

CPU: 2.59 G memory, solver time 1 m 22 s, simulation time 1 m 27 s

GPU: 2.84 G memory, solver time 28 s, simulation time 48 s

(3) heterogeneous_soil.in, cell number: 150*150*100=175*e4

CPU: 307 M memory, solver time 2 m 47 s, simulation time 3 m 3 s

GPU: 375 M memory, solver time 35.1 s, simulation time 50.55 s

(4) heterogeneous_soil.in, increased cell number: 300*300*500=4500*e4

CPU: can compute

GPU: warning of GPU memory beyond 4 G

I think the GPU mainly decreases the solver time, but it can not compute a large-scale model that consumes GPU memory. Generally, a large-scale model is in more need of GPU for solving linear system of equations, while the resulting memory consuming limits the utilization of GPU.

Mert Su

unread,

Oct 27, 2017, 10:30:56 AM10/27/17

to gprMax-users

Zhou,

Can you try a 3D example "Antenna 'like' Mala 1200" and report the results?

I am on Ubuntu 16.04, Intel i5-6660 4 Cores at 3.3 GHz, 16 GB Ram, one Quadro K1200 4GB.

This model consumes 1.2 GB of GPU Ram and I see a steep performance decline in number of iterations per second for the cuda cores. It starts with 80 its/sec and drops almost immediately to 20 its/sec in 5 secs and stabilizes around there.

I am wondering if you will see similar results.

Regards,

Mert

Craig Warren

unread,

Oct 27, 2017, 11:02:27 AM10/27/17

to gprMax-users

We have paper that gives more details of the GPU performance of gprMax that is currently in review. For now take a look at the graph on the News page of http://www.gprmax.com

It shows that GPU performance throughout increases with model size, reaching a plateau at ~300^3 cells and larger. The plateau is because the arrays that the GPU kernels are operating on become large enough to saturate the memory bandwidth of the GPU. The instantaneous its/sec counter is not really a reliable indicator of overall performance. A better measure is performance throughput in Mcells/s, e.g. P = (NX * NY * NZ * NT) / (T * 1e6), where NX, NY, NZ are the number of cells, NT is the number of time-steps, and T is the runtime of the simulation in seconds.

Kind regards,

Craig

Mert Su

unread,

Nov 2, 2017, 3:04:03 PM11/2/17

to gprMax-users

So, in essence a 560 GB/s bandwidth solves faster than 80 GB/s, if I understood it correctly.

Reply all

Reply to author

Forward