MATLAB® parallel for loops (parfor) allow the body of a for loop to be executed across multiple workers simultaneously, but with some pretty large restrictions. With Jacket MGL, Jacket can be used within parfor loops, with the same restrictions. However, it is important to note that Jacket MGL does not currently support co-distributed arrays.
Problem Size
Problem size might be the single most important consideration in parallelization using the Parallel Computing Toolbox (PCT) and Jacket MGL. When data is used by a worker in the MATLAB pool it must be copied from MATLAB to the worker, and must be copied back when the computation is complete. Additionally, when GPU data is used, it must then be copied by the worker to the GPU and back. With small, simple problems, parallelization may not offer a performance improvement because of the overhead of this data movement. Unfortunately, there is no simple answer to the question of “What size problem justifies multi-gpu parallelization?” We have found that experimentation provides the best answer.
Array Slicing
One of the ways to reduce the overheard of parallelization is to minimize the amount of data being copied. This can be done through array slicing. When an array is used with only a single variable dimension, the array can be “sliced” so that ONLY the relevant data is copied. For example, with a 128×128 array, there would be 16,384 entries. If the array were used by a worker with both dimensions being referenced with variables, the entire array must be copied. When only a single dimension is used, MATLAB only needs to copy the elements that the worker could possible refer to. In the example here, only 128 elements would need to be copied instead of 16,384.
for/gfor/parfor in parfor
It’s possible to use the various types of for loops embedded inside a parfor loop, but they cannot appear directly in the body of the loops. Instead, they must be inside of a function called by a loop. Here is a simple example:
parfor i=1:8
for j=1:8
m(i,j) = i*j;
end
end
The above example fails because of transparency. Other than the iterator variable of the parfor, MATLAB must be able to evaluate indices at the time the code is read, not at the time the code is executed. We can fix this example by changing it to:
function x = f(n, i)
m = n;
j = [];
gfor j=1:8
m(j) = i*j;
gend
end
m = gones(8);
parfor i=1:8
m(i,:) = f(m(i,:), i);
end
parfor in a gfor
It is not possible to use a parfor loop or an spmd command inside of a GFOR.
Variable Classification
Inside of a parallel loop, MATLAB attempts to classify each variable. The different classes are:
– Temporary variable: a variable created inside a single iteration whose lifespan is only that iteration
– Reduction variable: a variable which accumulates an its value across all iterations, but whose value make be computed regardless of the order of execution of the iterations.
– Broadcast variable: a variable declared outside of the loop and referenced in the loop, but whose value is not changed within the loop
– Sliced variable: a variable whose elements are operated on by independent iterations of the loop
– Loop variable: the iterator of the loop
If a variable cannot be classified into one of these categories, the loop will not be able to execute. Further restrictions on the ways variables may be classified and used are explained in the Advanced Parfor Topics on The MathWorks web site at:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/brdqtjj-1.html
Understanding Some of the Restrictions
Things that are not compatible with parallelization of for loops:
– Loops may not contain spmd statements.
– Loops may not contain global or persistent variable declarations.
– Loops may not contain break or return statements
For more information about the MATLAB-imposed restrictions on parallel for loops, visit:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/bq__cs7-1.html
Comments 3
How does CUDA/Jacket handle scheduling if I have say, 8 CPUs all doing CUDA FFTs within a parfor loop? I’ve tried this before (without Jacket) but the computer locked up (I have a separate video card).
Is this kind of functionality supported and/or stable?
Let me explain a little about ideal configuration, first. If you want to
use Jacket within a parfor loop, you ought to have a number of GPUs equal to
the number of workers you have started in PCT. So, for example, if you have
a 4 CPU machine and have it configured to use a matlab pool of 4 workers,
you would to have 4 GPUs.
If you have fewer GPUs than CPUs, our recommendation is to use fewer
workers. If you use more workers than GPUs, you will get into a situation
called thrashing. This is essentially where the different workers are
competing with each other for access to the GPUs. The end result is poor
performance because the GPUs spend too much time flipping from task to task,
rather than staying focused.
So, let’s assume you have an optimal configuration with “n” number of CPUs
and GPUs matching. In this case, with Jacket MGL, each of the workers would
have their own GPU. Your parfor loop will then split the iterations across
each of the CPU/GPU pairs, and each worker would accomplished 1/n of the
iterations. Assuming you use the Jacket gsingle or gdouble data types, your
FFTs would then be offloaded by the workers to the GPU for computation.
In small matrices, your FFTs may not see much speed up, but if you are using
large matrices, we’ve seen extremely significant performance improvements.
parfor i=1:8 for j=1:8 m(i,j) = i*j; endend==========this code works in my desktop computer with matlab 2011a