Zero Copy on Tegra K1

Shehzan ArrayFire 8 Comments


Zero Copy access has been a part of the CUDA Toolkit for a long time (~2009). However, there were very few applications using the capability because, with time, GPU memory has become reasonably large. The only applications using zero copy access were mainly ones with extremely large memory requirements, such as database processing.

Zero Copy is a way to map host memory and access it directly over PCIe without doing an explicit memory transfer. It allows CUDA kernels to directly access host memory. Instead of reading data from global memory (limit ~200GB/s), data would be read over PCIe and be limited by PCIe bandwidth (upto 16GB/s). Hence there was no real performance advantage for most applications.

However, with the release of the Jetson TK1 board with the Tegra K1 processor, this has become really useful. The Jetson board has 2GB of physical memory that is shared by the ARM CPU and the CUDA GPU. If a cudaMemcpy Host->Device is done on the Jetson, the memory is simply being copied to a new location on the same physical memory and retrieving a CUDA pointer from it. In such a scenario, Zero Copy access is perfect.

Standard CUDA Pipeline

Lets take a look at a standard pipeline in CUDA.

Zero Copy Access CUDA Pipeline

Now lets look at a pipeline using Zero Copy access.

As you can see, it is considerably simpler to use zero copy access on the Tegra from a coding perspective. The kernel code remains the same. You can also allocate host memory using the cudaHostAlloc call without using zero copy access. This allows the fast pinned memory transfer when doing a cudaMemcpy.


So how does it perform?

When I ran a transpose kernel using zero copy vs using standard pipeline, I achieve a the following results for a 4096 x 4096 matrix.

Pipeline Bandwidth (GB/s) Time (ms)
Standard Pipeline 3.0 45
Zero Copy Pipeline 5.8 23

The bandwidth of a device-to-device copy kernel on the Tegra is ~6.6 GB/s.

The results are tremendous. Any result, that is as good as or better than the standard pipeline is worth doing using zero copy. This saves memory usage as it does not require duplication on host and device.

There a a few caveats though. If you are running multiple kernels without modifying the data on the host side, it may be wise to run the standard pipeline. There is no right answer. Since it is so simple to modify the code for memory transfer vs zero copy pointers, it would be best to run both techniques and benchmark them.


We these results, we believe the Tegra K1 is a great option for streaming applications. Most streaming applications run image or signal processing algorithms, with a limitation of running at either 60 or 30 frames per second. On desktops, although the kernels themselves run well under the required time, the memory transfer times form a significant portion.

This is why the Tegra K1 can be great. We save 100% of the memory transfer times used by discrete GPUs by using Zero Copy access. This allow the Tegra K1 GPU to do the streaming operations within the performance constraints even though is is considerable under-powered compared to most desktop GPUs.

Of course, streaming is not the only application. There are many more, and if do happen to try one or want to try your application on the K1, we would be extremely excited to hear about it.


You can find the entire code from my transpose exercise here: transpose.
Note: If you wish to run this on a x86 system, make sure you change the compute version in the makefile from 32 to 30 (for Kepler).