Action Recognition with Independent Subspace Analysis

Researchers at the Stanford Artificial Intelligence Laboratory (SAIL) have had more success (building on previous work) using Jacket to speed up their algorithm.

In a paper at this year’s CVPR 2011, entitled “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis”, they explain how their unsupervised feature learning algorithm competes with other algorithms that are hand crafted or use learned features.

	KTH	Hollywood2	UCF	Youtube
Best published Results	92.1%	50.9%	85.6%	71.2%
Stanford group Results	93.9%	53.3%	86.5%	75.8%

Testing their algorithm on four well-known benchmark datasets, they were able to achieve better performance than existing results that have been published so far.

For their training purposes, they used a multi-layered stacked convolutional ISA (Independent subspace analysis) network. An ISA is used for learning features from image patches without supervision.

The architecture of an Independent Subspace Analysis network

The standard ISA algorithm however becomes computationally inefficient when the size of the image patches is scaled up. To overcome this problem, they developed a convolutional neural network which makes use of PCA and ISA at alternating levels. The output of the ISA at the a particular level was used to convolve a larger region of the image. The results of this convolution step were fed into the PCA layer for pre-processing before being passed on to the next ISA layer.

Architecture of a Stacked Convolutional ISA network

Learning spatio-temporal features from video signals was done by using their model to learn features from 3D video blocks rather than 2D image samples.

Modified architecture for 3D video blocks

The trained networks appear to have learned features that are robust to translation, but sensitive to frequency and rotation (at the first level). The features learned at the second level appear to represent more complex shapes such as corners.

Features learnt from the Hollywood2 database at first and second layers

The performance of their algorithm, as mentioned earlier, not beats the best published results in accuracy, but is also generally faster both while training and testing. The times taken for feature extraction of their algorithm are given below.

Algorithm	Frames per Second	Speedup
HOG3D	4.54	1.0
Stacked ISA (layer 1 only)	7.14	1.6
Stacked ISA (flayers 1 and 2)	2.27	0.5
Stacked ISA with Jacket (layers 1 and 2)	10	2.2

As you can see, using Jacket for their algorithm (dominated by matrix vector and convolution problems), a 4.4X speedup was achieved over the CPU implementation!

Special thanks to Quoc V. Le, Will Y. Zou, Serena Y. Yeung and Andrew Y. Ng from SAIL for sharing their research. We have more success stories from their group. Keep an eye out for more blog posts!

Leave a Reply Cancel reply