Using Parallel For Loops (parfor) with MATLAB® and Jacket

ArrayFireBenchmarks 3 Comments

MATLAB® parallel for loops (parfor) allow the body of a for loop to be executed across multiple workers simultaneously, but with some pretty large restrictions.  With Jacket MGL, Jacket can be used within parfor loops, with the same restrictions.  However, it is important to note that Jacket MGL does not currently support co-distributed arrays.

Problem Size

Problem size might be the single most important consideration in parallelization using the Parallel Computing Toolbox (PCT) and Jacket MGL.  When data is used by a worker in the MATLAB pool it must be copied from MATLAB to the worker, and must be copied back when the computation is complete.  Additionally, when GPU data is used, it must then be copied by the worker to the GPU and back.  With small, simple problems, parallelization may not offer a performance improvement because of the overhead of this data movement.  Unfortunately, there is no simple answer to the question of “What size problem justifies multi-gpu parallelization?”  We have found that experimentation provides the best answer.

Array Slicing

One of the ways to reduce the overheard of parallelization is to minimize the amount of data being copied.  This can be done through array slicing.  When an array is used with only a single variable dimension, the array can be “sliced” so that ONLY the relevant data is copied.  For example, with a 128×128 array, there would be 16,384 entries.  If the array were used by a worker with both dimensions being referenced with variables, the entire array must be copied.  When only a single dimension is used, MATLAB only needs to copy the elements that the worker could possible refer to.  In the example here, only 128 elements would need to be copied instead of 16,384.

for/gfor/parfor in parfor

It’s possible to use the various types of for loops embedded inside a parfor loop, but they cannot appear directly in the body of the loops.  Instead, they must be inside of a function called by a loop.  Here is a simple example:

parfor i=1:8
  for j=1:8
    m(i,j) = i*j;
  end
end

The above example fails because of transparency.  Other than the iterator variable of the parfor, MATLAB must be able to evaluate indices at the time the code is read, not at the time the code is executed.  We can fix this example by changing it to:

function x = f(n, i)
  m = n;
  j = [];
  gfor j=1:8
    m(j) = i*j;
  gend
end

m = gones(8);
parfor i=1:8
  m(i,:) = f(m(i,:), i);
end

parfor in a gfor

It is not possible to use a parfor loop or an spmd command inside of a GFOR.

Variable Classification

Inside of a parallel loop, MATLAB attempts to classify each variable.  The different classes are:

– Temporary variable:  a variable created inside a single iteration whose lifespan is only that iteration

– Reduction variable: a variable which accumulates an its value across all iterations, but whose value make be computed regardless of the order of execution of the iterations.

– Broadcast variable: a variable declared outside of the loop and referenced in the loop, but whose value is not changed within the loop

– Sliced variable: a variable whose elements are operated on by independent iterations of the loop

– Loop variable: the iterator of the loop

If a variable cannot be classified into one of these categories, the loop will not be able to execute.  Further restrictions on the ways variables may be classified and used are explained in the Advanced Parfor Topics on The MathWorks web site at:

http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/brdqtjj-1.html

Understanding Some of the Restrictions

Things that are not compatible with parallelization of for loops:

– Loops may not contain spmd statements.

– Loops may not contain global or persistent variable declarations.

– Loops may not contain break or return statements

For more information about the MATLAB-imposed restrictions on parallel for loops, visit:

http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/bq__cs7-1.html

Comments 3

  1. How does CUDA/Jacket handle scheduling if I have say, 8 CPUs all doing CUDA FFTs within a parfor loop? I’ve tried this before (without Jacket) but the computer locked up (I have a separate video card).

    Is this kind of functionality supported and/or stable?

    1. Let me explain a little about ideal configuration, first. If you want to
      use Jacket within a parfor loop, you ought to have a number of GPUs equal to
      the number of workers you have started in PCT. So, for example, if you have
      a 4 CPU machine and have it configured to use a matlab pool of 4 workers,
      you would to have 4 GPUs.

      If you have fewer GPUs than CPUs, our recommendation is to use fewer
      workers. If you use more workers than GPUs, you will get into a situation
      called thrashing. This is essentially where the different workers are
      competing with each other for access to the GPUs. The end result is poor
      performance because the GPUs spend too much time flipping from task to task,
      rather than staying focused.

      So, let’s assume you have an optimal configuration with “n” number of CPUs
      and GPUs matching. In this case, with Jacket MGL, each of the workers would
      have their own GPU. Your parfor loop will then split the iterations across
      each of the CPU/GPU pairs, and each worker would accomplished 1/n of the
      iterations. Assuming you use the Jacket gsingle or gdouble data types, your
      FFTs would then be offloaded by the workers to the GPU for computation.

      In small matrices, your FFTs may not see much speed up, but if you are using
      large matrices, we’ve seen extremely significant performance improvements.

  2. parfor i=1:8 for j=1:8 m(i,j) = i*j; endend==========this code works in my desktop computer with matlab 2011a
     

Leave a Reply

Your email address will not be published. Required fields are marked *