Optimization methods for deep learning

Researchers at SAIL (Stanford Artificial Intelligence Laboratory), have done it again. They have successfully used Jacket to speed up the training part of Deep Learning algorithms. In their paper titled “On Optimization Methods for Deep Learning”, they experiment with some of the well known training algorithms and demostrate their scalability across parallel architectures (GPUs as well as multi-machine networks). The algorithms include SGDs (Stochastic Gradient Descent) L-BFGS (Limited BFGS used for solving non-linear problems), CG (Conjugate Gradient).

While SGDs are easy to implement, they require manual tuning. Add to that their sequential nature, they are hard to tune, scale and parallelize making them difficult to use with Deep Learning algorithms. L-BFGS and CG algorithms can be harder to implement and are more computationally intensive. For speed benefits, they use approximated second order information for L-BFGS, and use the conjugacy information during the optimization step for CG. To overcome the scalability problems of L-BFGS and CG, which require gradient of the entire data set, they utilized mini-batch training which results in a faster algorithm for larger data sizes.

Following are few of the results they achieved using these algorithms for various problems. Each machine is equipped with 4 Intel CPU cores (at 2.67GHz) and a GeForce GTX 285 GPU.

Autoencoder training on the CPU

Autoencoder training on the GPU vs CPU

It can be seen that LBFGS and CG algorithms converge nearly twice as fast on the GPU compared to the CPU. SGD methods though neither improve nor degrade the performance. This behavior can be explained by the large number mini batch sizes preferred by LBFGS / CG compared to SGDs, making them more parallelized for the GPUs.

For the next experiment, they used the algorithms for training supervised Convoluted neural networks. They then split the workload of calculating the gradient across multiple machines with GPUs.

Training Supervised CNNs on a cluster of computers (with GPUs)

As it can be seen, the rate of convergence for LBFGS scales well all the way up to 8 machines!

They also talks about other optimization techniques, usage of these algorithms for sparse auto encoding and locally connected networks, and uitility / accuracy of the the LBFGS algorithm for Deep learning.

We would like to thank Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, Andrew Y. Ng and everyone at their group for letting us use and share their work.

Leave a Reply Cancel reply