Zero Copy on Tegra K1

Introduction

Zero Copy access has been a part of the CUDA Toolkit for a long time (~2009). However, there were very few applications using the capability because, with time, GPU memory has become reasonably large. The only applications using zero copy access were mainly ones with extremely large memory requirements, such as database processing.

Zero Copy is a way to map host memory and access it directly over PCIe without doing an explicit memory transfer. It allows CUDA kernels to directly access host memory. Instead of reading data from global memory (limit ~200GB/s), data would be read over PCIe and be limited by PCIe bandwidth (upto 16GB/s). Hence there was no real performance advantage for most applications.

However, with the release of the Jetson TK1 board with the Tegra K1 processor, this has become really useful. The Jetson board has 2GB of physical memory that is shared by the ARM CPU and the CUDA GPU. If a cudaMemcpy Host->Device is done on the Jetson, the memory is simply being copied to a new location on the same physical memory and retrieving a CUDA pointer from it. In such a scenario, Zero Copy access is perfect.

Standard CUDA Pipeline

Lets take a look at a standard pipeline in CUDA.

// Host Arrays
float* h_in  = new float[sizeIn];
float* h_out = new float[sizeOut];

//Process h_in

// Device arrays
float *d_out, *d_in;

// Allocate memory on the device
cudaMalloc((void **) &d_in,  sizeIn ));
cudaMalloc((void **) &d_out, sizeOut));

// Copy array contents of input from the host (CPU) to the device (GPU)
cudaMemcpy(d_in, h_in, sizeX * sizeY * sizeof(float), cudaMemcpyHostToDevice);

// Launch the GPU kernel
kernel<<>>(d_out, d_in);

// Copy result back
cudaMemcpy(h_out, d_out, sizeOut, cudaMemcpyDeviceToHost);

// Continue processing on host using h_out

Zero Copy Access CUDA Pipeline

Now lets look at a pipeline using Zero Copy access.

// Set flag to enable zero copy access
cudaSetDeviceFlags(cudaDeviceMapHost);

// Host Arrays
float* h_in  = NULL;
float* h_out = NULL;

// Process h_in

// Allocate host memory using CUDA allocation calls
cudaHostAlloc((void **)&h_in,  sizeIn,  cudaHostAllocMapped);
cudaHostAlloc((void **)&h_out, sizeOut, cudaHostAllocMapped);

// Device arrays
float *d_out, *d_in;
// Get device pointer from host memory. No allocation or memcpy
cudaHostGetDevicePointer((void **)&d_in,  (void *) h_in , 0);
cudaHostGetDevicePointer((void **)&d_out, (void *) h_out, 0);

// Launch the GPU kernel
kernel<<>>(d_out, d_in);

// No need to copy d_out back
// Continue processing on host using h_out

As you can see, it is considerably simpler to use zero copy access on the Tegra from a coding perspective. The kernel code remains the same. You can also allocate host memory using the cudaHostAlloc call without using zero copy access. This allows the fast pinned memory transfer when doing a cudaMemcpy.

Results

So how does it perform?

When I ran a transpose kernel using zero copy vs using standard pipeline, I achieve a the following results for a 4096 x 4096 matrix.

Pipeline	Bandwidth (GB/s)	Time (ms)
Standard Pipeline	3.0	45
Zero Copy Pipeline	5.8	23

The bandwidth of a device-to-device copy kernel on the Tegra is ~6.6 GB/s.

The results are tremendous. Any result, that is as good as or better than the standard pipeline is worth doing using zero copy. This saves memory usage as it does not require duplication on host and device.

There a a few caveats though. If you are running multiple kernels without modifying the data on the host side, it may be wise to run the standard pipeline. There is no right answer. Since it is so simple to modify the code for memory transfer vs zero copy pointers, it would be best to run both techniques and benchmark them.

Applications

We these results, we believe the Tegra K1 is a great option for streaming applications. Most streaming applications run image or signal processing algorithms, with a limitation of running at either 60 or 30 frames per second. On desktops, although the kernels themselves run well under the required time, the memory transfer times form a significant portion.

This is why the Tegra K1 can be great. We save 100% of the memory transfer times used by discrete GPUs by using Zero Copy access. This allow the Tegra K1 GPU to do the streaming operations within the performance constraints even though is is considerable under-powered compared to most desktop GPUs.

Of course, streaming is not the only application. There are many more, and if do happen to try one or want to try your application on the K1, we would be extremely excited to hear about it.

Code

You can find the entire code from my transpose exercise here: transpose.
Note: If you wish to run this on a x86 system, make sure you change the compute version in the makefile from 32 to 30 (for Kepler).

Comments 8

Pingback: Zero Copy access
Vasiliy --
October 16, 2014 at 12:29 pm

How would we tell arrayfire to use this cudaHostGetDevicePointer instead of memcopy on TK1?

array a(10,10, h_data)
Is there another argument we can give to avoid memcopy?

1. Shehzan Mohammed
  October 16, 2014 at 3:40 pm
  
  Unfortunately we have not built in this capability into ArrayFire since we did a direct port of ArrayFire rather than take advantage of this feature.
  
  The first step would be to allocate host data using array::pinned function (http://www.arrayfire.com/docs/group__array__func__cpp__pinned.htm). This function calls cudaHostAlloc internally.
  
  You can then fetch the device pointer using cudaHostGetDevicePointer as shown in the code. Then pass the device pointer to array a(size, dev_ptr, af::device);.
  
  The problem here would be that there is no way to stop ArrayFire from deleting the device memory when the array goes out of scope, this may lead to errors since CUDA does not like the device pointer being freed using cudaFree.
  
  To free the host memory, use the array::free function as shown in the documentation for array::pinned.
  
  I haven’t tested this out. But if you would like to, let me know how it goes.
  
Ahmed ADA
January 20, 2015 at 9:25 am

First, thank you for this nice article. My question is about the CPU transpose time. Indeed, with the normal transpose (transpose), I got the following times:

******************************************
***CPU Transpose***
Elapsed Time for 10 runs = 6351.26ms
Bandwidth (GB/s) = 0.211324
******************************************

And with the zero copy version I got this:

******************************************
***CPU Transpose***
Elapsed Time for 10 runs = 28057.9ms
Bandwidth (GB/s) = 0.0478359
******************************************

How can we explain this big difference in run time for the CPU transpose despite the fact they are
running on the same processor ?

Pingback: NVIDIA Tegra TK/X系列板子的零拷贝(zero copy)问题 – FindSpace
lauren
December 15, 2017 at 7:14 am

Hello! Thank you for the great artcile!
I am trying to replace the standard allocation and transport with the zero copy allocation and transport in a project where i read my arrays in a C file and i accelerate one function in the gpu (cuda C file) but i am facing 2 problems:
a) cudaHostAlloc((void **)&h_in, sizeIn, cudaHostAllocMapped) is not supported in C file. This is why i cannot replace it with the first allocation of data (malloc).
b) if i use ” cudaHostAlloc((void **)&h_in, sizeIn, cudaHostAllocMapped) ” in a .cu file, the accesses in the data h_in (stored in Zero Copy Memory) are much slower than if h_in had declared with traditional malloc.

So to use Zero Copy Memory transport i have to do an extra malloc (cudaHost () ) and an extra copy (copy the data that are traditionally allocated to the data stored in Zero Copy Memory ).
Can i somehow avoid the extra allocations and transports?

Thank you in advance!