One Jacket programmer recently emailed the following to us:
Our chief scientists asked me a question that I’d like to pass on to you. I think I know the answer, but you guys can be much more definitive than I can.
He recently read about people achieving ~10x speedups by converting parts of their code to MEX files. He was wondering how much of the observed speedup is due to that MEX and how much is due to CUDA and the GPU.
Two Questions You Should Ask Yourself
When contemplating an effort to optimize a piece of code, it is important to unravel the effort into two separate questions. Both need to be addressed to improve performance:
- How well-written is the code, given the language, hardware, and software libraries used?
- Is there a better platform (comprised of the language, hardware, and software libraries) for the problem?
Understanding the Two Questions
Holding the underlying language, hardware, and software libraries constant, speedups can often be attained simply by writing better code more tuned to the environment. For instance, MATLAB code almost always benefits from vectorization modifications and the associated removal of FOR-loops. MathWorks recommends vectorization. We provide examples in our documentation too. Researchers at Northeastern University saw 100-fold speedups through vectorization efforts alone.
It is impossible to generically state that writing better code will result in a certain amount of performance improvement. The more awful the original code, the more you stand to gain from this effort.
Changing the Platform
Oftentimes, the underlying language, hardware, or software libraries are not sufficient to produce the intended result, no matter how well-written the code may be. In this case, you will want to search for a better platform. For instance, GPUs are really good at exploiting data-parallelism. So if you are working with matrices or vectors containing over 10,000 elements, GPUs are probably going to help.
Of course, if you decide that there is a better language/HW/SW platform, you will need to revisit Question #1 to ensure that your code is well-written for the new platform.
It is a happy synergy that both CPU-based MATLAB and GPU-based Jacket work better on vectorized code. So any code optimizations made for one platform benefits the other.
Should I rewrite my “M functions as MEX functions”?
Now, given the framework of these 2 questions, we see that the “M functions vs MEX functions” inquiry boils down to whether or not it makes sense to change the platform. In the search for faster MATLAB code, there are two options you might consider, separated by a large degree of difficulty:
Platform | Difficulty | CPU optimizations | Data-Parallel Optimizations |
C/CUDA | Harder | Re-write FOR-loops in C/MEX | Exploit parallelism with CUDA. Integrate into MATLAB with the Jacket SDK and GFOR. |
MATLAB | Easier | Vectorize code. | Exploit parallelism with Jacket data types and GFOR. |
As mentioned previously, any change of platform requires a revisiting of Question #1 to ensure the code is well-written. Keep in mind that CUDA is tough.
Conclusions
In the end, the speedups achieved from an optimization effort depend on how well the code is written for the underlying platform and on how well that underlying platform is able to exploit parallelism in the problem.
Comments 1
I’ve found that I can get speedups of a factor of 2-4x sometimes by using matlab’s emlmex command to take my matlab code directly into a mex file. Basically what emlmex does is get matlab to take your m-file and turn it into a mex file so that it gets compiled and the loop speeds up a lot.
I emailed mathworks about this once. They claim it need not always create speedups, but it can have very substantial effects sometimes, and it’s pretty easy to implement — there’s no need to know how to write C code. Sometimes you just can’t avoid a loop.