How long does it take to get 98X performance improvement with GPUs?

Well, here is a recent story with one of our customers that accomplished 98X performance speed up with Jacket in 16 days. Of the 16 days, 15 days were spent sending emails back and forth about the problem and less than a day was spent getting the customer code in Jacket and running some performance tests! Who would have imagined GPU programming with performance in 1 day. Happy Reading.

Day 1:

Customer uses inverse radon (or iradon in MATLAB terms) extensively for their back projection algorithms. They would like to know when the iradon function will be available/supported in Jacket.

AccelerEyes product management informs the customer that the inverse radon algorithm used in MATLAB is based on the filtered back project algorithm. The key elements of that algorithm are interpolation, fft and frequency domain filtering which are all supported in Jacket and should provide decent speed-up for large enough data sizes.

Day 4:

The customer provides email response to product management indicating a couple of things:

They would be interested in working with AccelerEyes and Jacket on an initial implementation of inverse radon given the information provided in the Day 1 response.
If the initial implementation demonstrated a substantial savings in execution time, the customer could allocate the appropriate internal resource to leverage Jacket and GPUs for their needs.
They had some (very limited) experience with programming GPUs directly (we assume CUDA), and had determined that the approach would require far too much development effort from a relatively small staff of software developers.
The customer is ready to work on a prototype/pilot to get some results.

Day 17:

6:49am: After 13 days, the customers sends a follow up email to product management and corporate email box asking whether or not AccelerEyes was interested in working with the customer on the inverse radon prototype/pilot.

“turns out the spam-filters are not always your best friend as the customers email on Day 4 got caught in the corporate spam filter and the product manager never got the message” – morale of the story is follow up email with a phone call in time sensitive situations!

11:31am: After profuse apologies by product management regarding the delay, an exchange of information about the customer code begins. A review of the use case for the function is undergone (use case = the MATLAB code developed by the customer for the application). The MATLAB code for the iradon function itself , in addition to the MATLAB built-in functions are also reviewed with the customer to understand all aspects of the application/algorithm.

2:31pm: It becomes clear what needs to be done with Jacket and how parts of the customer’s MATLAB code and other m-files need to be prepared for Jacket.

For all Jacket users: many of you are interested in various MATLAB functions to be implemented in Jacket to ease the programming effort to leverage GPUs, increase performance on more applications and in general achieve more results. Although the outline below of what needed to be done to get iradon working for this customer was not a design goal and is NOT for all Jacket users, this approach can be used by many Jacket customers to achieve results today without waiting for the specific function or functions to show up in Jacket documentation and fully supported.

The approach:

Step 1: Find the function in the MATLAB source tree that is bottlenecking your performance – in this case “iradon.m”

Step 2: Save the function or .m file as a GPU function like foo.m, in this case the function was saved as “giradon.m”

Step 3: In your foo.m function (this case giradon.m), pass the input variables from your application into the new function you are creating (example – function [img,H] = iradon(varargin) becomes function [img,H] = giradon(varargin))

Step 4: Add a “g” pre-fix to zeros when allocating memory which will allocate memory on the GPU for execution – if applicable

Step 5: Assign variables to be GPU variables p=gsingle(p); that will be used during execution of the most computational intensive calls within the m-file. It is only necessary to make GPU variables of the data that goes to the GPU and where the Jacket function is supported – fft for example. In this particalur case the following code was added to the iradon.m file to make it Jacket compatible for ‘linear’ case.

%% Change 1: Add below line 158

%% Converting to gsingle values for GPU usage :

costheta=gsingle(costheta);

sintheta=gsingle(sintheta);

p=gsingle(p);

x=gsingle(x);

y=gsingle(y);

ctrIdx=gsingle(ctrIdx);

Step 6: Vectorize code where possible. Especially where for-loops are involved. In the iradon case the customer vectorized the for-loop in the function filterProjections() as shown below:

%% Change 2: Lines 204-206

%% Following ‘for’ loop unrolled using vectorizations

% for i = 1:size(p,2)

% p(:,i) = p(:,i).*H; % frequency domain filtering

% end

%% After Vectorizing the above for-loop :

p(:,1:size(p,2))=p(:,1:size(p,2)).*repmat(H,1,size(p,2));

%%

Step 6: Open your MATLAB code that calls the bottleneck function. You have two options here, you can program your MATLAB code to always use GPUs by changing the call to iradon.com (in this example) to foo.m or giradon.m or you can set a flag that will ask the user if they have GPUs or not.

If no, they follow the normal execution of the code using the standard m-files.

If yes, you will need to follow an execution path that has the iradon.m file (from this example) changed to giradon.m or foo.m.

Once you have modified your MATLAB application, save and run the code with the GPU option on.

3:54pm: Customer needs to test and benchmark changes to code and chooses to use the following data sizes for initial testing while a determination is made of the most typical parameters used in reconstructions :

P = phantom(128);

R = radon(P,0:179);

I1 = iradon(R,0:179);

I2 = iradon(R,0:179,’linear’,’none’);

Day 18:

10:17am: Customer determines that the most typical parameters would be to increase the phantom size to 1024 X 1024 ( P = phantom(1024) ) from P = phantom(128). It turned out the number of angles used is 180 so the initial use case of 0:179 was in line with the requirement.

5:55pm: The following performance results are achieved with Jacket on GPUs using two different hardware configurations and various dimensions of data.

System 1:

CPU: Intel Xeon Dual Core 2.8 GHz, 1GB RAM, (2 Cores)

GPU: GeForce GTX 295, 1212 MHz, 895 MB VRAM, Compute 1.3

OS: Window XP (32-bit)

MATLAB Version: R2009b

Jacket Version: 1.2.2

Phantom Dim	Projections Dim	‘iradon’ runtime (sec)	‘giradon’ runtime (sec)	Speed GPU vs CPU	Error Norm (inf)
128×128	185×180	0.64	0.98	0.65	1.77E-006
256×256	367×180	2.45	0.91	2.69	3.35E-006
512×512	729×180	10.39	0.97	10.71	6.82E-006
1024×1024	1453×180	41.46	1.15	36.05	1.84E-005
2048×2048	2901×180	167.07	1.69	98.86	3.12E-005

System 2:

CPU: Intel(R) Dual Quad Core Xeon(R) CPUs W5580 @ 3.20GHz, 32GB RAM, (8 cores)

GPU: Tesla C1060, 1265 MHz, 4095 MB VRAM, Compute 1.3

OS: Linux Fedora 10 (32-bit)

MATLAB Version: R2009b

Jacket Version: 1.2.2

Phantom Dim	Projections Dim	‘iradon’ runtime (sec)	‘giradon’ runtime (sec)	Speed GPU vs CPU	Error Norm (inf)
128×128	185×180	0.1	0.4	0.25	1.77E-006
256×256	367×180	0.38	0.41	0.93	3.35E-006
512×512	729×180	1.82	0.43	4.23	6.82E-006
1024×1024	1453×180	8.35	0.51	16.37	1.84E-005
2048×2048	2901×180	38.14	1.93	19.76	3.12E-005

The “iradon” runtimes were run on the CPU where the “giradon” runtimes were leveraging the GPU. Of course the title of this blog post was a little misleading as it is not true that 98X speed up was the case in every data size and every configuration as performance does vary in all applications based on code, data size, hardware and other factors. In this case, an everyday laptop with a GeForce GPU can provide almost 100X performance improvement versus a garden variety Intel Core Duo 2.8 GHz processor for the right size data problem. But even when you run on state of the art hardware, both CPU and GPU, and the data size is appropriate for the GPU architectures, it is not unusual to get 10 to 20X performance improvement.

There is actually a bigger story here than the performance improvements and speed up of this code! It has to do with productivity. This customer had already determined that they did not have the resources, knowledge and expertise to program CUDA to leverage the GPU resources. Jacket provides a platform for engineers, scientists and analysts, that are familiar with MATLAB, to quickly (in this case, a couple of hours) make current applications and algorithms work on GPUs to achieve substantial performance gains, even when the latest and greatest hardware is not immediately at ones disposal. Shortening the time to solution is one of the major value points for Jacket and should be considered even if an option to program in CUDA is available.

PS – one more point to the story, email is great but today’s spam filters can really disrupt communication if alternative methods are not used in conjunction with our new way of life.

How long does it take to get 98X performance improvement with GPUs?

Comments 2

Leave a Reply Cancel reply