Zero Copy vs Managed Memory in Tegra K/X

Carlos Haas

unread,

Nov 17, 2015, 7:42:01 PM11/17/15

to ArrayFire Users

I saw your presentation from GTC2015 "Application Development for Mobile Devices - Tegra K1" and the two slides you didn't explained are the ones I'm very interested :)

It's clear for me the difference between ZC and Managed Memory in traditional PCIe architecture. Managed works with duplicated memory content (device and host) and PCIe transfers transparently managed.

But, in Tegra K1, there would be no need for duplicated memory content, since Tegra provides Unified Physical Memory.

So do you know how managed memory works in Tegra?

Assuming there's no copy on both when in Tegra, by using managed and not ZC, one could avoid the "cudaHostGetDevicePointer()" functions.

Any thoughts?

Shehzan Mohammed

unread,

Nov 18, 2015, 12:20:05 PM11/18/15

to ArrayFire Users

From what we have seen, the pointer values for the cudaHostAlloc and cudaHostGetDevicePointer remain the same on the Tegra K1 when using zero copy.

So technically, you could use the same pointer directly in your kernels. However, from a coding stand point, I would recommend using the cudaHostGetDevicePointer API to keep the code clean and readable (and portable too). I have not analysed the performance implications of this, but since its probably simply copying the pointer, it would not effect performance in any way.

As for Unified Memory, I'll need to dig into that a bit. I'm away at SC15 so can't do much right now. You can expect to hear back from me by Monday.

-Shehzan

Carlos Haas

unread,

Nov 23, 2015, 11:36:35 AM11/23/15

to ArrayFire Users

Thank you Shehzan, ok, got it for ZeroCopy.

When you have more details about managed memory and Unified Memory on Tegra K/X I would appreciate,

Att

Carlos

Shehzan Mohammed

unread,

Nov 23, 2015, 7:06:34 PM11/23/15

to ArrayFire Users

TL;DR Version: Unified Memory is not working on the Tegra K1. On the Tegra X1, the Unified Memory model is behaving same as zero copy.

This inference is from nvprof/nvvp. The unified and zero copy executions do not show any copy calls.

NOTE: However, this must be taken with a pinch of salt as the "unified memory profiling" option for nvprof is unsupported on the Tegra devices. I cannot confirm with 100% certainty that there is no copy happening. From the execution patterns, it looks more likely that there is no copy happening. (For comparison, the x86 profile shows D->H and H->D copy execution for the unified api when seen in NVVP).

I have attached 3 cuda source files (standard, unified, zero copy) along with 6 nvprof files, 3 for TX1 and 3 for x86. These can be opened in NVVP if you wish to dig further.

The compilation command I used was

/usr/local/cuda/bin/nvcc file.cu -ccbin /usr/bin/cc -gencode arch=compute_XY,code=sm_XY -I/usr/local/cuda/include -o file

Where file is the filename, XY is the compute code of your GPU. 32 for K1, 53 for X1.

Lastly, I'm not sure why unified memory file is not working on the Tegra K1. I've tried to investigate it and put it more time into debugging it than I would have liked to. If you wish to investigate it, let me know what comes up.