Multi Parallel GPU With MAGMA

aran nokan

unread,

Sep 27, 2021, 11:35:23 AM9/27/21

to MAGMA User

Hi,

Do we have an easy to understand example in MAGMA for multi GPU programing?

How can I keep them running in parallel?

Should I have invoked a multiple CPU thread and do set_device for with thread to allow GPUs work in parallel?

Best regards,

Aran

Stanimire Tomov

unread,

Sep 27, 2021, 1:58:31 PM9/27/21

to MAGMA User, noka...@gmail.com

Hi Aran,

Yes, that would be one possibility.

I would classify this programming approach as similar to MPI - each thread will work with

specific GPU, running its own set of instructions, similar to MPI, and there will be some

communications that will synchronize when needed with the work of the other threads (GPUs).

In magma we have a little different approach. Typically we have algorithms that do a panel (set of columns)

factorization followed by an update of the trailing matrix. The panel factorization is typically done

on the CPU and the updates are done on the GPUs. The matrix is distributed in 1D block cyclic fashion over

the memories of the different GPUs. You may want to do the same if you have multiple threads.

In magma a single thread queues the work that the GPUs have to do. For example, after a panel is done

on the CPU, the result gets sent to the GPUs along with "scheduling" for the updates. The calls that the single thread

is doing to do the "scheduling" are all asynchronous so they should be equivalent to what you were thinking -

dispatching the calls in parallel by each thread. It will be interesting to see if there will be any difference.

We have thought to do this before and had some experiments but have not find the time to investigate

it more rigorously and make it from prototype to production code as part of the library.

Thanks,

Stan

aran nokan

unread,

Sep 28, 2021, 11:31:50 AM9/28/21

to Stanimire Tomov, MAGMA User

Thanks Stan,

Here I have provided a quick simple test.

https://paste.debian.net/hidden/b6b40e90/

I have two GEMM and want to run those GEMMs in parallel (maybe it is easier to just have one thread and a multi GPU).

Just adding a new queue for another device and running the second GEMM with that extra queue is enough? It is not clear to me. Also maybe outside of the loop I will have other kernels that are running on the main queue and main GPU.

Best regards,

Aran

aran nokan

unread,

Sep 28, 2021, 11:42:57 AM9/28/21

to Stanimire Tomov, MAGMA User

Also about allocation memory on the second device. How can I manage it with one thread?

Stanimire Tomov

unread,

Sep 28, 2021, 12:51:27 PM9/28/21

to aran nokan, Stanimire Tomov, MAGMA User

Aran,

Following your example notations, instead of one queue you can have two, i.e., one for each device.

After that you set the device with magma_setdevice(device_number); and every call after that

will be sent to GPU device_number, including when you do malloc.

magma_queue_t queue[2];

…

// allocate memory and create queues

for(int d=0; d < 2; d++ ) {
      magma_setdevice(d);
      if (MAGMA_SUCCESS != magma_dmalloc( &dA[d], m*k )) {
            *info = MAGMA_ERR_DEVICE_ALLOC;
            return *info;
      }
      if (MAGMA_SUCCESS != magma_dmalloc( &dB[d], k*n )) {
            *info = MAGMA_ERR_DEVICE_ALLOC;
            return *info;
      }
      if (MAGMA_SUCCESS != magma_dmalloc( &dC[d], m*n )) {
            *info = MAGMA_ERR_DEVICE_ALLOC;
            return *info;
      }
      magma_queue_create( d, &queue[d] );
}

…

// call gemms in parallel

for (int i=0; i<4; i++){

    magma_setdevice(0);

    magma_dgemm ( MagmaNoTrans, MagmaNoTrans, m, n, k, 1, dA[0], lda, dB[0], lda, 1, dC[0], lda, queue[0] );

    magma_setdevice(1);

    magma_dgemm ( MagmaNoTrans, MagmaNoTrans, m, n, k, 1, dA[1], lda, dB[1], lda, 1, dC[1], lda, queue[1] );

}

Note that I changed the notations of the arrays, i.e., dA on GPU 0 is dA[0], 
and dD on GPU 1 is dA[1], etc.

Stan

aran nokan

unread,

Sep 28, 2021, 3:02:01 PM9/28/21

to Stanimire Tomov, MAGMA User

Awesome, thanks!

Do we have any option to do data movement between 2 GPUs faster? I don't have a Nvlink connection and just PCI Express.

I am using "magma_dcopymatrix_async" and seeing that data first go to the Host and after that from Host to another Device.

Aran

Stanimire Tomov

unread,

Sep 28, 2021, 3:33:04 PM9/28/21

to aran nokan, Stanimire Tomov, MAGMA User

Yes, you can copy from one GPU to the CPU and from the CPU to the other GPU.

Nvidia also provides this which may be more efficient implementation:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g88fd1245b2cb10d2d30c74900b7dfb9c

Look also the example in samples/p2pBandwidthLatencyTest

It first checks if hardware can access directly peers. If not, the calls fall back to normal memcopy procedure.

Here is the output for example that I get on a DGX box with 8 A100 GPUs

(you can run it on your system to see what to expect):

-bash-4.2$ ./p2pBandwidthLatencyTest 

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]

Device: 0, A100-SXM-80GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0

Device: 1, A100-SXM-80GB, pciBusID: f, pciDeviceID: 0, pciDomainID:0

Device: 2, A100-SXM-80GB, pciBusID: 47, pciDeviceID: 0, pciDomainID:0

Device: 3, A100-SXM-80GB, pciBusID: 4e, pciDeviceID: 0, pciDomainID:0

Device: 4, A100-SXM-80GB, pciBusID: 87, pciDeviceID: 0, pciDomainID:0

Device: 5, A100-SXM-80GB, pciBusID: 90, pciDeviceID: 0, pciDomainID:0

Device: 6, A100-SXM-80GB, pciBusID: b7, pciDeviceID: 0, pciDomainID:0

Device: 7, A100-SXM-80GB, pciBusID: bd, pciDeviceID: 0, pciDomainID:0

Device=0 CAN Access Peer Device=1

Device=0 CAN Access Peer Device=2

Device=0 CAN Access Peer Device=3

Device=0 CAN Access Peer Device=4

Device=0 CAN Access Peer Device=5

Device=0 CAN Access Peer Device=6

Device=0 CAN Access Peer Device=7

Device=1 CAN Access Peer Device=0

Device=1 CAN Access Peer Device=2

Device=1 CAN Access Peer Device=3

Device=1 CAN Access Peer Device=4

Device=1 CAN Access Peer Device=5

Device=1 CAN Access Peer Device=6

Device=1 CAN Access Peer Device=7

Device=2 CAN Access Peer Device=0

Device=2 CAN Access Peer Device=1

Device=2 CAN Access Peer Device=3

Device=2 CAN Access Peer Device=4

Device=2 CAN Access Peer Device=5

Device=2 CAN Access Peer Device=6

Device=2 CAN Access Peer Device=7

Device=3 CAN Access Peer Device=0

Device=3 CAN Access Peer Device=1

Device=3 CAN Access Peer Device=2

Device=3 CAN Access Peer Device=4

Device=3 CAN Access Peer Device=5

Device=3 CAN Access Peer Device=6

Device=3 CAN Access Peer Device=7

Device=4 CAN Access Peer Device=0

Device=4 CAN Access Peer Device=1

Device=4 CAN Access Peer Device=2

Device=4 CAN Access Peer Device=3

Device=4 CAN Access Peer Device=5

Device=4 CAN Access Peer Device=6

Device=4 CAN Access Peer Device=7

Device=5 CAN Access Peer Device=0

Device=5 CAN Access Peer Device=1

Device=5 CAN Access Peer Device=2

Device=5 CAN Access Peer Device=3

Device=5 CAN Access Peer Device=4

Device=5 CAN Access Peer Device=6

Device=5 CAN Access Peer Device=7

Device=6 CAN Access Peer Device=0

Device=6 CAN Access Peer Device=1

Device=6 CAN Access Peer Device=2

Device=6 CAN Access Peer Device=3

Device=6 CAN Access Peer Device=4

Device=6 CAN Access Peer Device=5

Device=6 CAN Access Peer Device=7

Device=7 CAN Access Peer Device=0

Device=7 CAN Access Peer Device=1

Device=7 CAN Access Peer Device=2

Device=7 CAN Access Peer Device=3

Device=7 CAN Access Peer Device=4

Device=7 CAN Access Peer Device=5

Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.

So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix

     D\D     0     1     2     3     4     5     6     7

     0	     1     1     1     1     1     1     1     1

     1	     1     1     1     1     1     1     1     1

     2	     1     1     1     1     1     1     1     1

     3	     1     1     1     1     1     1     1     1

     4	     1     1     1     1     1     1     1     1

     5	     1     1     1     1     1     1     1     1

     6	     1     1     1     1     1     1     1     1

     7	     1     1     1     1     1     1     1     1

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1      2      3      4      5      6      7 

     0 1530.36  15.31  17.26  16.72  17.42  18.01  17.51  17.29 

     1  17.07 1537.89  17.80  16.30  17.34  17.96  17.48  17.31 

     2  17.30  17.28 1534.87  14.74  17.44  17.84  17.81  17.31 

     3  17.87  17.45  16.74 1564.06  17.46  17.96  17.86  17.01 

     4  17.98  17.67  17.83  16.77 1570.35  16.14  17.13  16.74 

     5  18.04  17.26  18.07  16.65  16.29 1401.35  17.10  16.75 

     6  17.90  17.61  18.18  16.66  17.87  17.45 1565.63  13.83 

     7  17.85  17.31  18.05  17.03  17.84  17.44  16.33 1568.78 

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)

   D\D     0      1      2      3      4      5      6      7 

     0 1537.89 270.59 272.78 274.94 274.29 273.81 275.01 274.38 

     1 270.84 1547.03 270.07 274.46 274.38 273.48 274.45 273.68 

     2 270.91 272.77 1543.97 274.32 274.40 274.07 274.18 274.06 

     3 272.90 274.47 273.07 1591.14 273.75 273.74 274.76 273.78 

     4 271.64 273.92 273.63 275.52 1583.08 274.03 274.80 273.96 

     5 272.29 274.11 273.85 273.20 273.65 1579.88 272.52 274.70 

     6 272.51 274.07 274.23 273.61 274.06 274.91 1578.28 274.47 

     7 273.57 272.20 272.57 273.86 275.42 274.71 274.11 1587.91 

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1      2      3      4      5      6      7 

     0 1545.50  15.28  25.56  23.90  24.22  24.13  23.62  19.20 

     1  17.11 1582.28  18.86  18.88  23.97  24.08  23.49  21.80 

     2  24.82  19.31 1587.10  17.78  25.12  25.64  24.99  22.62 

     3  23.95  18.91  17.39 1583.08  23.12  23.33  23.13  21.83 

     4  24.35  24.13  24.45  23.12 1586.29  18.72  25.35  24.72 

     5  24.85  23.78  24.23  23.23  18.38 1586.29  25.31  24.54 

     6  25.18  24.43  24.51  23.20  25.77  25.38 1589.52  13.08 

     7  19.58  24.74  25.65  23.81  22.53  22.44  12.33 1485.27 

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)

   D\D     0      1      2      3      4      5      6      7 

     0 1548.56 426.36 426.36 427.06 427.76 426.94 427.99 427.99 

     1 427.88 1547.03 428.11 428.11 427.88 428.11 428.11 426.01 

     2 428.08 428.82 1545.50 426.71 428.82 429.52 429.05 428.58 

     3 427.06 428.46 428.70 1543.97 428.46 428.77 427.99 428.23 

     4 429.08 430.85 431.67 430.33 1591.14 519.32 518.81 516.75 

     5 429.11 430.86 431.89 430.76 518.98 1586.29 516.41 517.95 

     6 429.36 430.77 430.55 431.77 517.09 516.41 1581.48 517.78 

     7 430.02 430.63 430.31 429.70 517.95 518.97 516.58 1579.08 

P2P=Disabled Latency Matrix (us)

   GPU     0      1      2      3      4      5      6      7 

     0   3.08  24.67  24.96  25.37  24.61  24.72  24.63  24.73 

     1  24.69   2.90  24.98  25.34  24.64  24.60  24.62  24.63 

     2  25.41  25.05   2.95  24.66  25.69  24.63  24.63  24.65 

     3  25.45  25.03  24.69   2.95  25.59  24.65  24.62  24.61 

     4  25.34  25.33  24.92  25.29   2.95  24.71  24.79  24.77 

     5  25.64  25.67  24.60  25.54  24.59   2.64  24.50  24.60 

     6  24.69  24.98  24.71  24.81  24.63  24.61   2.64  24.63 

     7  25.66  25.47  25.48  25.60  24.51  24.59  24.58   2.64 

   CPU     0      1      2      3      4      5      6      7 

     0   4.74  15.10  15.00  14.97  13.12  13.01  13.71  12.96 

     1  14.96   4.60  14.88  14.85  12.95  12.86  13.58  12.95 

     2  14.91  14.75   4.65  14.86  12.99  12.92  13.64  12.93 

     3  14.93  14.77  14.80   4.62  12.99  12.97  13.75  12.92 

     4  13.62  13.47  13.57  13.56   4.03  11.72  12.43  11.63 

     5  13.62  13.50  13.57  13.53  11.62   4.02  12.49  11.60 

     6  14.04  13.99  14.07  14.03  12.10  12.09   4.24  12.09 

     7  13.73  13.54  13.56  13.54  11.61  11.64  12.32   4.00 

P2P=Enabled Latency (P2P Writes) Matrix (us)

   GPU     0      1      2      3      4      5      6      7 

     0   3.08   3.35   3.37   3.43   3.47   3.36   3.36   3.43 

     1   3.40   2.89   3.35   3.39   3.42   3.41   3.35   3.41 

     2   3.43   3.35   2.94   3.40   3.37   3.39   3.39   3.36 

     3   3.42   3.45   3.35   2.93   3.34   3.34   3.36   3.35 

     4   3.52   3.48   3.46   3.52   2.93   3.45   3.53   3.52 

     5   3.51   3.52   3.55   3.45   3.51   2.93   3.51   3.45 

     6   3.01   2.93   2.93   2.93   2.92   2.92   2.64   2.98 

     7   3.04   3.14   3.04   3.10   3.03   3.10   3.14   2.64 

   CPU     0      1      2      3      4      5      6      7 

     0   4.79   4.04   3.56   3.42   4.26   4.15   4.05   4.08 

     1   4.19   4.79   4.05   4.42   4.09   4.09   4.07   4.09 

     2   4.16   4.08   4.80   4.12   4.09   4.07   4.08   4.07 

     3   4.18   4.08   4.08   4.89   4.13   4.15   3.69   3.61 

     4   4.22   4.21   4.10   4.15   4.88   4.06   4.08   4.16 

     5   3.56   3.53   3.56   3.52   3.51   4.23   3.48   3.49 

     6   3.76   3.66   3.69   3.83   3.78   3.76   4.44   3.89 

     7   4.28   4.25   3.62   3.59   3.53   3.52   3.59   4.18 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Reply all

Reply to author

Forward