The same codes got different results from different generations of GPU

0 views
Skip to first unread message

Pei Li

unread,
Aug 7, 2016, 7:26:18 PM8/7/16
to Numba Public Discussion - Public
Hi, did anyone experience the same problem or similar problem that the same codes of cuda.jit funcitons got different results from different generations of GPUs. I finished codes of 2D lattice Boltzmann method, which used cuda.jit to accelerate the computation speed. The good thing is that it reaches more than 100 times speed-up and it will save much time for me in future. On the other hand, there is a weird and bad phenomenon. Currently, the codes work very well with Fermi and Kepler architecture (my laptop has GForce610M and the desktop in the office has Quodra K420). When I start to run the codes on the workstation, which has GTX950 (2GB-memory, Maxwell architecture), some grid values are always updated incorrectly. At the beginning, I guessed it might by caused by the race between threads, because 'if' was used there. However, adding cuda.syncthreads still did not work. Does anyone know whether it can be caused by the support of numba for Maxwell architecture or other factors?  A piece of codes making problem happen are below:

    if (isFluid[indicesK] == True and indicesK < totalNum):
        fluidDensity = distrFIn[0] + distrFIn[1] + distrFIn[2] + distrFIn[3] + \
            distrFIn[4] + distrFIn[5] + distrFIn[6] + distrFIn[7] + distrFIn[8]
        squareV = 1.5 * (equilibriumV[0] * equilibriumV[0] + equilibriumV[1] * \
                equilibriumV[1])
        fEq0 = 4./9. * fluidDensity * (1. - squareV)
        fluidDensity *= 1./9.
        fEq1 = fluidDensity * (1. + 3.0 * equilibriumV[0] + 4.5 * equilibriumV[0] * \
                equilibriumV[0] - squareV)
        fEq3 = fEq1 - 6.0 * equilibriumV[0] * fluidDensity
        fEq2 = fluidDensity * (1. + 3.0 * equilibriumV[1] + 4.5 * equilibriumV[1] * \
                equilibriumV[1] - squareV)
        fEq4 = fEq2 - 6.0 * equilibriumV[1] * fluidDensity
        fluidDensity *= 1./4.
        fEq5 = fluidDensity * (1. + 3.0 * (equilibriumV[0] + equilibriumV[1]) + \
                4.5 * (equilibriumV[0] + equilibriumV[1]) * (equilibriumV[0] + \
                equilibriumV[1]) - squareV)
        fEq6 = fluidDensity * (1. + 3.0 * (-equilibriumV[0] + equilibriumV[1]) + \
                4.5 * (-equilibriumV[0] + equilibriumV[1]) * (-equilibriumV[0] + \
                equilibriumV[1]) - squareV)
        fEq7 = fluidDensity * (1. + 3.0 * (-equilibriumV[0] - equilibriumV[1]) + \
                4.5 * (-equilibriumV[0] - equilibriumV[1]) * (-equilibriumV[0] - \
                equilibriumV[1]) - squareV)
        fEq8 = fluidDensity * (1. + 3.0 * (equilibriumV[0] - equilibriumV[1]) + \
                4.5 * (equilibriumV[0] - equilibriumV[1]) * (equilibriumV[0] - \
                equilibriumV[1]) - squareV)
        distrFIn[0] += (fEq0 - distrFIn[0]) / tau
        distrFIn[1] += (fEq1 - distrFIn[1]) / tau
        distrFIn[2] += (fEq2 - distrFIn[2]) / tau
        distrFIn[3] += (fEq3 - distrFIn[3]) / tau
        distrFIn[4] += (fEq4 - distrFIn[4]) / tau
        distrFIn[5] += (fEq5 - distrFIn[5]) / tau
        distrFIn[6] += (fEq6 - distrFIn[6]) / tau
        distrFIn[7] += (fEq7 - distrFIn[7]) / tau
        distrFIn[8] += (fEq8 - distrFIn[8]) / tau
                 
When I changed 'if' condition to let this computation work for all the grids in the domain, the problem just disappeared......But my simulation does not have the condition that the simulated domain is full of fluid.

Thanks very much,

Pei

Stanley Seibert

unread,
Aug 8, 2016, 9:58:37 AM8/8/16
to Numba Public Discussion - Public
This still sounds like a race condition, but it is hard to tell from this code fragment what the problem could be.  We're not aware of any Maxwell-specific compiler issues.

syncthreads is primarily used to manage shared memory to avoid race conditions.  Which of these arrays are stored in shared memory?

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/318c7177-c734-46ef-97d8-f5af5620d2b6%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Pei Li

unread,
Aug 8, 2016, 6:54:04 PM8/8/16
to numba...@continuum.io
Thank you very much, Stanley. Indeed, this problem really confused me.I also guessed it can be caused by race condition, but why can it disappear when the code were run with the other two graphic cards? Sorry for not providing complete code.  The code of whole function is below and you can see which array is located in the shared memory.

@cuda.jit('void(int64, int64, int64, float64, float64[:], float64[:], float64[:, :], \
            float64[:, :], boolean[:], boolean[:])')
def calCollisionStreamingGPU(nx, ny, numberThreads, tau, equilibriumVX, equilibriumVY, \
            fluidDistrOld, fluidDistrNew, isDomain, isSolid):
    totalNum = nx * ny
    tx = cuda.threadIdx.x; bx = cuda.blockIdx.x; by = cuda.blockIdx.y
    xStart = tx + bx * numberThreads
    yStart = by
    indicesK = nx * yStart + xStart
    #allocate shared memory
    distrFOut0 = cuda.shared.array(shape = (32,), dtype=float64)
    distrFOut1 = cuda.shared.array(shape = (32,), dtype=float64)
    distrFOut2 = cuda.shared.array(shape = (32,), dtype=float64)
    distrFOut3 = cuda.shared.array(shape = (32), dtype=float64)
    distrFOut4 = cuda.shared.array(shape = (32,), dtype=float64)
    distrFOut5 = cuda.shared.array(shape = (32,), dtype=float64)
    distrFOut6 = cuda.shared.array(shape = (32,), dtype=float64)
    distrFOut7 = cuda.shared.array(shape = (32,), dtype=float64)
    distrFOut8 = cuda.shared.array(shape = (32,), dtype=float64)

    distrFIn = cuda.local.array(shape = (9,), dtype=float64)
    distrFIn[0] = fluidDistrOld[0, indicesK]
    distrFIn[1] = fluidDistrOld[1, indicesK]
    distrFIn[2] = fluidDistrOld[2, indicesK]
    distrFIn[3] = fluidDistrOld[3, indicesK]
    distrFIn[4] = fluidDistrOld[4, indicesK]
    distrFIn[5] = fluidDistrOld[5, indicesK]
    distrFIn[6] = fluidDistrOld[6, indicesK]
    distrFIn[7] = fluidDistrOld[7, indicesK]
    distrFIn[8] = fluidDistrOld[8, indicesK]
    
    equilibriumV = cuda.local.array(shape = (2,), dtype=float64)
    equilibriumV[0] = equilibriumVX[indicesK]
    equilibriumV[1] = equilibriumVY[indicesK]
    
    if (isFluid[indicesK] == True and indicesK < totalNum):
#    if (xStart < nx and yStart < ny):
    if (isSolid[indicesK] == True  and indicesK < totalNum):
        if (xStart < nx and yStart < ny):
            tmp = distrFIn[1]; distrFIn[1] = distrFIn[3]; distrFIn[3] = tmp
            tmp = distrFIn[2]; distrFIn[2] = distrFIn[4]; distrFIn[4] = tmp
            tmp = distrFIn[5]; distrFIn[5] = distrFIn[7]; distrFIn[7] = tmp
            tmp = distrFIn[6]; distrFIn[6] = distrFIn[8]; distrFIn[8] = tmp
        
    if (tx == 0):
        distrFOut1[tx + 1] = distrFIn[1]
        distrFOut3[numberThreads - 1] = distrFIn[3]
        distrFOut5[tx + 1] = distrFIn[5]
        distrFOut6[numberThreads - 1] = distrFIn[6]
        distrFOut7[numberThreads - 1] = distrFIn[7]
        distrFOut8[tx + 1] = distrFIn[8]
    elif (tx == numberThreads - 1):
        distrFOut1[0] = distrFIn[1]
        distrFOut3[tx - 1] = distrFIn[3]
        distrFOut5[0] = distrFIn[5]
        distrFOut6[tx - 1] = distrFIn[6]
        distrFOut7[tx - 1] = distrFIn[7]
        distrFOut8[0] =distrFIn[8]
    else:
        distrFOut1[tx + 1] = distrFIn[1]
        distrFOut3[tx - 1] = distrFIn[3]
        distrFOut5[tx + 1] = distrFIn[5]
        distrFOut6[tx - 1] = distrFIn[6]
        distrFOut7[tx - 1] = distrFIn[7]
        distrFOut8[tx + 1] = distrFIn[8]
    #synchronize
    cuda.syncthreads()
     #write to global memory
    
 
    fluidDistrNew[0, indicesK] = distrFIn[0]
    fluidDistrNew[1, indicesK] = distrFOut1[tx]
    fluidDistrNew[3, indicesK] = distrFOut3[tx]
        
    if (by < ny - 1):
        indicesK = nx * (yStart + 1) + xStart
        if (xStart < nx and yStart < ny):
            fluidDistrNew[2, indicesK] = distrFIn[2]
            fluidDistrNew[5, indicesK] = distrFOut5[tx]
            fluidDistrNew[6, indicesK] = distrFOut6[tx]
    
    if (by > 0):
        indicesK = nx * (yStart - 1) + xStart
        if (xStart < nx and yStart < ny):
            fluidDistrNew[4, indicesK] = distrFIn[4]
            fluidDistrNew[7, indicesK] = distrFOut7[tx]
            fluidDistrNew[8, indicesK] = distrFOut8[tx]
   

Siu Kwan Lam

unread,
Aug 9, 2016, 12:33:29 PM8/9/16
to numba...@continuum.io
Race condition can be revealed by difference in the instruction/warp scheduling in different CUDA arch.  For example, depending on the arch and the instruction, an instruction can be executed by an entire warp at once, divided into 2 half-warp or divided into 4 quarter-warp.  The latter cases will make some operation execute sequentially within a warp.  If there really is a race-condition, having more warps would probably reveal the problem in older arch as well.  A warp is 32 consecutive threads.  Increasing the thread-per-block to 64 will result in 2 warps per block.  Looking at the shared memory size (32), I believe your kernel is running with one warp per block.


To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.
--
Siu Kwan Lam
Software Engineer
Continuum Analytics

Pei Li

unread,
Aug 9, 2016, 7:37:14 PM8/9/16
to numba...@continuum.io
Hi, Sui,

Thanks very much for your help. Difference of instruction/warp scheduling in different arch can explain why this problem happened. Later I will test in different computers and keep you updated.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.
--
Siu Kwan Lam
Software Engineer
Continuum Analytics

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

Pei Li

unread,
Aug 11, 2016, 5:31:03 PM8/11/16
to numba...@continuum.io
Hi, Siu and Stanley,

Problem has been solved, after I divided the original collision-streaming function into two functions. I guess that there were too many 'if' control statements in the original function, which created race condition somewhere and caused problem with Maxwell architecture. With decreasing the complication, now the simulation can run very well with all three types of GPU.

Siu Kwan Lam

unread,
Aug 12, 2016, 10:42:46 AM8/12/16
to numba...@continuum.io
Pei,

I don't think too many 'if' can cause race condition.  The point you separated the function will act like a whole GPU synchronization.  This leads me to wonder if your problem requires inter-block communication.  The syncthreads barrier is only good for intra-block (threads with the same block).  If the race condition happens among threadblocks on global memory, the only way to synchronize is to split the function.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.
--
Siu Kwan Lam
Software Engineer
Continuum Analytics

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.

Pei Li

unread,
Aug 14, 2016, 7:31:36 AM8/14/16
to numba...@continuum.io
Hi, Siu,

You're right. My simulation needs inter-block communication in the streaming process. But I remembered when I debugged the code, the problem happened in the collision part, which does not require inter-block communication. During the collision process, the threads for solid phase did not really follow the instruction and resulted 0. values for some distribution functions, or the values of fluid1's distribution function were written to the those of fluid0. I wonder whether it is possible to do whole GPU synchronization in the function. Maybe after doing this test, I can figure out what the real reason is for this problem.

Thanks very much.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.
--
Siu Kwan Lam
Software Engineer
Continuum Analytics

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.
--
Siu Kwan Lam
Software Engineer
Continuum Analytics

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

Siu Kwan Lam

unread,
Aug 15, 2016, 11:01:05 AM8/15/16
to numba...@continuum.io
I wonder whether it is possible to do whole GPU synchronization in the function.

There is no whole GPU sync within the same kernel (grid).  Splitting the kernel is the way.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.
--
Siu Kwan Lam
Software Engineer
Continuum Analytics

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.
--
Siu Kwan Lam
Software Engineer
Continuum Analytics

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.
--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.
Reply all
Reply to author
Forward
0 new messages