Yes, you can copy from one GPU to the CPU and from the CPU to the other GPU.
Nvidia also provides this which may be more efficient implementation:
Look also the example in samples/p2pBandwidthLatencyTest
It first checks if hardware can access directly peers. If not, the calls fall back to normal memcopy procedure.
Here is the output for example that I get on a DGX box with 8 A100 GPUs
(you can run it on your system to see what to expect):
-bash-4.2$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, A100-SXM-80GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 1, A100-SXM-80GB, pciBusID: f, pciDeviceID: 0, pciDomainID:0
Device: 2, A100-SXM-80GB, pciBusID: 47, pciDeviceID: 0, pciDomainID:0
Device: 3, A100-SXM-80GB, pciBusID: 4e, pciDeviceID: 0, pciDomainID:0
Device: 4, A100-SXM-80GB, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
Device: 5, A100-SXM-80GB, pciBusID: 90, pciDeviceID: 0, pciDomainID:0
Device: 6, A100-SXM-80GB, pciBusID: b7, pciDeviceID: 0, pciDomainID:0
Device: 7, A100-SXM-80GB, pciBusID: bd, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1530.36 15.31 17.26 16.72 17.42 18.01 17.51 17.29
1 17.07 1537.89 17.80 16.30 17.34 17.96 17.48 17.31
2 17.30 17.28 1534.87 14.74 17.44 17.84 17.81 17.31
3 17.87 17.45 16.74 1564.06 17.46 17.96 17.86 17.01
4 17.98 17.67 17.83 16.77 1570.35 16.14 17.13 16.74
5 18.04 17.26 18.07 16.65 16.29 1401.35 17.10 16.75
6 17.90 17.61 18.18 16.66 17.87 17.45 1565.63 13.83
7 17.85 17.31 18.05 17.03 17.84 17.44 16.33 1568.78
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1537.89 270.59 272.78 274.94 274.29 273.81 275.01 274.38
1 270.84 1547.03 270.07 274.46 274.38 273.48 274.45 273.68
2 270.91 272.77 1543.97 274.32 274.40 274.07 274.18 274.06
3 272.90 274.47 273.07 1591.14 273.75 273.74 274.76 273.78
4 271.64 273.92 273.63 275.52 1583.08 274.03 274.80 273.96
5 272.29 274.11 273.85 273.20 273.65 1579.88 272.52 274.70
6 272.51 274.07 274.23 273.61 274.06 274.91 1578.28 274.47
7 273.57 272.20 272.57 273.86 275.42 274.71 274.11 1587.91
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1545.50 15.28 25.56 23.90 24.22 24.13 23.62 19.20
1 17.11 1582.28 18.86 18.88 23.97 24.08 23.49 21.80
2 24.82 19.31 1587.10 17.78 25.12 25.64 24.99 22.62
3 23.95 18.91 17.39 1583.08 23.12 23.33 23.13 21.83
4 24.35 24.13 24.45 23.12 1586.29 18.72 25.35 24.72
5 24.85 23.78 24.23 23.23 18.38 1586.29 25.31 24.54
6 25.18 24.43 24.51 23.20 25.77 25.38 1589.52 13.08
7 19.58 24.74 25.65 23.81 22.53 22.44 12.33 1485.27
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1548.56 426.36 426.36 427.06 427.76 426.94 427.99 427.99
1 427.88 1547.03 428.11 428.11 427.88 428.11 428.11 426.01
2 428.08 428.82 1545.50 426.71 428.82 429.52 429.05 428.58
3 427.06 428.46 428.70 1543.97 428.46 428.77 427.99 428.23
4 429.08 430.85 431.67 430.33 1591.14 519.32 518.81 516.75
5 429.11 430.86 431.89 430.76 518.98 1586.29 516.41 517.95
6 429.36 430.77 430.55 431.77 517.09 516.41 1581.48 517.78
7 430.02 430.63 430.31 429.70 517.95 518.97 516.58 1579.08
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 3.08 24.67 24.96 25.37 24.61 24.72 24.63 24.73
1 24.69 2.90 24.98 25.34 24.64 24.60 24.62 24.63
2 25.41 25.05 2.95 24.66 25.69 24.63 24.63 24.65
3 25.45 25.03 24.69 2.95 25.59 24.65 24.62 24.61
4 25.34 25.33 24.92 25.29 2.95 24.71 24.79 24.77
5 25.64 25.67 24.60 25.54 24.59 2.64 24.50 24.60
6 24.69 24.98 24.71 24.81 24.63 24.61 2.64 24.63
7 25.66 25.47 25.48 25.60 24.51 24.59 24.58 2.64
CPU 0 1 2 3 4 5 6 7
0 4.74 15.10 15.00 14.97 13.12 13.01 13.71 12.96
1 14.96 4.60 14.88 14.85 12.95 12.86 13.58 12.95
2 14.91 14.75 4.65 14.86 12.99 12.92 13.64 12.93
3 14.93 14.77 14.80 4.62 12.99 12.97 13.75 12.92
4 13.62 13.47 13.57 13.56 4.03 11.72 12.43 11.63
5 13.62 13.50 13.57 13.53 11.62 4.02 12.49 11.60
6 14.04 13.99 14.07 14.03 12.10 12.09 4.24 12.09
7 13.73 13.54 13.56 13.54 11.61 11.64 12.32 4.00
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 3.08 3.35 3.37 3.43 3.47 3.36 3.36 3.43
1 3.40 2.89 3.35 3.39 3.42 3.41 3.35 3.41
2 3.43 3.35 2.94 3.40 3.37 3.39 3.39 3.36
3 3.42 3.45 3.35 2.93 3.34 3.34 3.36 3.35
4 3.52 3.48 3.46 3.52 2.93 3.45 3.53 3.52
5 3.51 3.52 3.55 3.45 3.51 2.93 3.51 3.45
6 3.01 2.93 2.93 2.93 2.92 2.92 2.64 2.98
7 3.04 3.14 3.04 3.10 3.03 3.10 3.14 2.64
CPU 0 1 2 3 4 5 6 7
0 4.79 4.04 3.56 3.42 4.26 4.15 4.05 4.08
1 4.19 4.79 4.05 4.42 4.09 4.09 4.07 4.09
2 4.16 4.08 4.80 4.12 4.09 4.07 4.08 4.07
3 4.18 4.08 4.08 4.89 4.13 4.15 3.69 3.61
4 4.22 4.21 4.10 4.15 4.88 4.06 4.08 4.16
5 3.56 3.53 3.56 3.52 3.51 4.23 3.48 3.49
6 3.76 3.66 3.69 3.83 3.78 3.76 4.44 3.89
7 4.28 4.25 3.62 3.59 3.53 3.52 3.59 4.18
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.