Deep Learning with Kernel Regularization for Visual Recognition Kai Yu Wei Xu Yihong Gong NEC Laboratories America, Cupertino, CA 95014, USA {kyu, wx, ygong}@sv.nec-labs.com Abstract In this paper we aim to train deep neural networks for rapid visual recognition. The task is highly challenging, largely due to the lack of a meaningful regularizer on the functions realized by the networks. We propose a novel regularization method that takes advantage of kernel methods, where a given kernel function represents prior knowledge about the recognition task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate encouraging results on a wide range of recognition tasks, in terms of both accuracy and speed. 1 Introduction Visual recognition remains a challenging task for machines. This difficulty stems from the large pattern variations under which a recognition system must operate. The task is extremely easy for a human, largely due to the expressive deep architecture employed by human visual cortex systems. Deep neural networks (DNNs) are argued to have a greater capacity to recognize a larger variety of visual patterns than shallow models, because they are considered biologically plausible. However, training deep architectures is difficult because the large number of parameters to be tuned necessitates an enormous amount of labeled training data that is often unavailable. Several authors have recently proposed training methods by using unlabeled data. These methods perform a greedy layer-wise pre-training using unlabeled data, followed by a supervised fine-tuning [9, 3, 15]. Even though the strategy notably improves the performance, to date, the best reported recognition accuracy on popular benchmarks such as Caltech101 by deep models is still largely behind the results of shallow models. Beside using unlabeled data, in this paper we tackle the problem by leveraging additional prior knowledge. In the last few decades, researchers have developed successful kernel-based systems for a wide range of visual recognition tasks. Those sensibly-designed kernel functions provide an extremely valuable source of prior knowledge, which we believe should be exploited in deep learning. In this paper, we propose an informative kernel-based regularizer, which makes it possible to train DNNs with prior knowledge about the recognition task. Computationally, we propose to solve the learning problem using stochastic gradient descent (SGD), as it is the de facto method for neural network training. To this end we transform the kernel regularizer into a loss function represented as a sum of costs by individual examples. This results in a simple multi-task architecture where a number of extra nodes at the output layer are added to fit a set of auxiliary functions automatically constructed from the kernel function. We apply the described method to train convolutional neural networks (CNNs) for a wide range of visual recognition tasks, including handwritten digit recognition, gender classification, ethnic origin recognition, and object recognition. Overall our approach exhibits excellent accuracy and speed on all of these tasks. Our results show that incorporation of prior knowledge can boost the performance of CNNs by a large margin when the training set is small or the learning problem is difficult. 1 2 DNNs with Kernel Regularization In our setting, the learning model, a deep neural network (DNN), aims to learn a predictive function f : X R that can achieve a low expected discrepancy E [ (y, f (x))] over the distribution p(x, y ). In the simplest case Y = {-1, 1} and (·, ·) is a differentiable hinge loss. Based on a set of labeled examples [(xi , yi )]n 1 , the learning is by minimizing a regularized loss i= L( , ) = in =1 yi , 1 i + 0 + 1 2 (1) where i = (xi ; ) maps xi to q -dimensional hidden units via a nonlinear deep architecture with parameters , including the connection weights and biases of all the intermediate layers, = {1 , 0 }, 1 includes all the parameters of the transformation from the last hidden layer to the output layer, 0 is a bias term, > 0, and a 2 = tr(a a) is the usual weight decay regularization. Applying the well-known representor theorem, we derive the equivalence to a kernel system1 in jn in y i , L(, 0 , ) = j Ki,j + 0 + i j Ki,j (2) =1 =1 ,j =1 where the kernel is computed by Ki,j = (xi ; ), (xj ; ) = i j We assume the network is provided with some prior knowledge, in the form of an m × m kernel matrix , computed on n labeled training data, plus possibly additional m - n unlabeled data if m > n. We exploit this prior knowledge via imposing a kernel regularization on K () = [Ki,j ]mj =1 , such i, that the learning problem seeks Problem 2.1. min L( , ) + () , (3) where > 0 and () is defined by () = tr K ()-1 + log det[K ()] (4) This is a case of semi-supervised learning if m > n. Though is non-convex w.r.t. K , it has a unique minimum at K = if 0, suggesting that minimizing () encourages K to approach . The regularization can be explained from an information-theoretic perspective. Let p(f |K ) and p(f |) be two Gaussian distributions N (0, K ) and N (0, ).2 () is related to the KL-divergence DK L [p(f |) p(f |K )]. Therefore, minimizing () forces the two distributions to be close. We note that the regularization does not require to be positive definite -- it can be semidefinite. 3 Kernel Regularization via Stochastic Gradient Descent The learning problem in Eq. (3) can be solved by using gradient-based methods. In this paper we emphasize large-scale optimizations using stochastic gradient descent (SGD), because the method is fast when the size m of total data is large and backpropagation, a typical SGD, has been the de facto method to train neural networks for large-scale learning tasks. SGD considers the problem where the optimization cost is the sum of the local cost of each individual training example. A standard batch gradient descent updates the model parameters by using the true gradient summed over the whole training set, while SGD approximates the true gradient by the gradient caused by a single random training example. Therefore, the parameters of the model In this paper we slightly abuse the notation, i.e., we use L to denote different loss functions. However their meanings should be uniquely identified by checking the input parameters. 2 From a Gaussian process point of view, a kernel function defines the prior distribution of a function f , such that the marginal distribution of the function values f on any finite set of inputs is a multivariate Gaussian. 1 2 are updated after each training example. For large data sets, SGD is often much faster than batch gradient descent. However, because the regularization term defined by Eq. (4) does not consist of a cost function that can be expressed as a sum (or an average) over data examples, SGD is not directly applicable. Our idea is to transform the problem into an equivalent formulation that can be optimized stochastically. 3.1 Shrinkage on the Kernel Matrix We consider a large-scale problem where the data size m may grow over time, while the size of the last hidden layer (q ) of the DNN is fixed. Therefore the computed kernel K can be rank deficient. In order to ensure that the trace term in () is well-defined, and that the log-determinant term is bounded from below, we instead use K + I to replace K in (), where > 0 is a small shrinkage parameter and I is an identity matrix. Thus the log-determinant acts on a much smaller q × q matrix3 + log det(K + I ) = log det + I const where = [1 , . . . , m ] and const = (m - q ) · log . Omitting all the irrelevant constants, we then turn the kernel regularization into ( + () = tr + I )-1 log det( + I ) (5) The kernel shrinkage not only remedies the ill-posedness, but also yields other conveniences in our later development. 3.2 Transformation of the Log-determinant Term n By noticing that = i=1 i i is a sum of quantities over data examples, we move it outside of the log determinant for the convenience of SGD. Theorem 3.1. Consider min {L() = h() + g (a)}, where g (·) is concave and a a() is a function of , if its local minimum w.r.t. exists, then the problem is equivalent to L ( min (, ) = h() + a() - g · ( ) 6) , where g · ( ) is the conjugate function of g (a), i.e. g · ( ) = mina { a - g (a)}.4 Proof. For a concave function g (a), the conjugate function of its conjugate function is itself, i.e., g (a) = min {a - g · ( )}. Since g · ( ) is concave, a - g · ( ) is convex w.r.t. and has the unique minimum g (a). Therefore minimizing L(, ) w.r.t. and is equivalent to minimizing L() w.r.t. . Since log-determinant is concave for q × q positive definite matrices A, the conjugate function of log det(A) is log det() + q . We can use the above theorem to transform any loss function containing log det(A) into another loss, which is an upper bound and involves A in a linear term. Therefore the log-determinant in Eq. (5) is turned into a variational representation w m i = log det + I min i i + · tr() - log det() + const + Sq =1 here S+ is a q × q positive definite matrix, and const = -q . As we can see, the upper bound q is a convex function of auxiliary variables and more importantly, it amounts to a sum of local quantities caused by each of the m data examples. 3 Hereafter in this paper, with a slight abuse of notation, we use "const" in equations to summarize the terms irrelevant to the variables of interest. 4 If g (a) is convex, its conjugate function is g ( ) = maxa { a - g (a)}. 3 3.3 Transformation of the Trace Term We assume that the kernel matrix is presented in a decomposed form = U U , with U = [u1 , . . . , um ] , ui Rp , and p m. We have found that the trace term can be cast as a variational problem by introducing an q × p auxiliary variable matrix . Proposition 3.1. The trace term in Eq. (5) is equivalent to a convex variational representation P m i ( = + -1 ui - i 2 + 2 tr I ) min 1 Rq×p =1 1 roof. We first obtain the analytical solution = ( + I )-1 U , where the variational representation reaches its unique minimum. Then, plugging it back into the function, we have =1 1 1 tr U U - 2 U + + U ( + I )-2 U = ( 1 UU w tr - U ( + I )-1 U tr + I )-1 U U here the last step is derived by applying the Woodbury matrix identity. Again, we note that the upper bound is a convex function of , and consists of a sum of local costs over data examples. 3.4 An Equivalent Learning Framework Combining the previous results, we obtain the convex upper bound for the kernel regularization Eq. (5), which amounts to a sum of costs over examples under some regularization L w 1 + im i ui - i 2 + i () ( , , ) = 2 + · tr() - log det() =1 here we omit all the terms irrelevant to , and . L( , , ) is convex w.r.t. and , and has a unique minimum (), hence we can replace () by instead minimizing the upper bound and formulate an equivalent learning problem L ( min ( , , , ) = L( , ) + L( , , ) 7) , ,, Clearly this new optimization can be solved by SGD. When applying the SGD method, each step based on one example needs to compute the inverse of . This can be computationally unaffordable when the dimensionality is large (e.g. q > 1000) -- remember that the efficiency of SGD is dependent on the lightweight of each stochastic update. Our next result suggests that we can dramatically reduce this complexity from O(q 3 ) to O(q ). Proposition 3.2. Eq. (5) is equivalent to the convex variational problem ( m + 1 kq i 2 2 e ui - i 2 + i +· - log k 8) () = min , =1 =1 where = [1 , . . . , q ] r , and e = [1, . . . , 1] . Proof. There is an ambiguity of the solutions up to rotations. Suppose { , , , } is an optimal solution set, a transformation R , R , R , and R R esults in the same optimality if R R = I . Since there always exists an R to diagonalize , we can pre-restrict to be a diagonal positive definite matrix = diag[1 , . . . , q ], which does not change our problem and gives rise to Eq. (8). We note that the variational form is convex w.r.t. the auxiliary variables and . Therefore we can formulate the whole learning problem as 4 Problem 3.1. , , , min L 1 ( , , , ) = L1 ( , ) + L2 ( , ) + L3 ( , ) n mn mn im =1 ( 9) where L1 ( , ) is defined by Eq. (1), and L2 ( , ) = L3 ( , ) = ui - 1 2 i i 2+ - kq =1 2 im =1 +· e log k To ensure the estimator of and is consistent, the effect of regularization should vanish as n . Therefore we intentionally normalize L2 ( , ) and L3 ( , ) by 1/m. The overall loss function is averaged over the n labeled examples. The loss function Eq. (9) is a summation of three loss functions: the main classification task L1 ( , ), an auxiliary least-squares regression problem L2 ( , ), and an additional regularization term L3 ( , ), which can be interpreted as another least-squares problem. Since each of the loss functions amounts to a summation of local costs caused by individual data examples, the whole learning problem can be conveniently implemented by SGD. In addition, due to the equivalence q m between minimizing L3 ( , ) and minimizing k=1 log( i=1 2,k + ) + const, the model eni courages the latent representations i to be low-rank [6]. The SGD is described in Algorithm 1. In practice, the kernel matrix = U U that represents domain knowledge can be obtained in three different ways: (i) In the easiest case, U is directly available by computing some hand-crafted features computed from the input data, which corresponds to a case of a linear kernel function; (ii) U can be results of some unsupervised learning (e.g. the self-taught learning [14] based on sparse coding), applied on a large set of unlabeled data; (iii) If a nonlinear kernel function is available, U can be obtained by applying incomplete Cholesky decomposition on an m × m kernel matrix . In the third case, when m is so large that the matrix decomposition cannot be computed in the main ¨ memory, we apply the Nystrom method [19]: We first randomly sample m1 examples p < m1 < m, such that the computed kernel matrix 1 can be decomposed in the memory. Let V DV be the prank eigenvalue decomposition of 1 , then the p-rank decomposition of can be approximated by 1 U U , U = :,1 V D- 2 , where :,1 is the m × m1 kernel matrix between all the m examples and the subset of size m1 . Algorithm 1 Stochastic Gradient Descent repeat Generate a number a from uniform distribution [0, 1] if a < mn n then + Randomly pick a sample i {1, · · · , n} for L1 , and update parameter by [ , ] [ , ] - L1 (xi , , ) [ , ] else Randomly pick a sample i {1, · · · , m} for L2 , and update parameter by [ , , ] [ , , ] - end if until convergence m [L2 (xi , , ) + L3 (xi , , )] [ , , ] 4 Visual Recognition by Deep Learning with Kernel Regularization In the following, we apply the proposed strategy to train a class of deep models and convolutional neural networks (CNNs, [11]) for a range of visual recognition tasks including digit recognition on MNIST dataset, gender and ethnicity classification on the FRGC face dataset, and object recognition 5 Table 1: Percentage error rates of handwritten digit recognition on MNIST Training Size 100 600 1000 3000 60000 SVM (RBF) 22.73 8.53 6.58 3.91 1.41 ¨ SVM (RBF, Nystrom) 24.73 9.15 6.92 5.51 5.16 SVM (Graph) 5.21 3.74 3.46 3.01 2.23 SVM (Graph, Cholesky) 7.17 6.47 5.75 4.28 2.87 CNN 19.40 6.40 5.50 2.75 0.82 kCNN (RBF) 14.49 3.85 3.40 1.88 0.73 kCNN (Graph) 4.28 2.36 2.05 1.75 0.64 CNN (Pretrain) [15] - 3.21 - - 0.64 EmbedO CNN [18] 11.73 3.42 3.34 2.28 - EmbedI 5 CNN [18] 7.75 3.82 2.73 1.83 - EmbedA1 CNN [18] 7.87 3.82 2.76 2.07 - on the Caltech101 dataset. In each of these tasks, we choose a kernel function that has been reported to have state-of-the-art or otherwise good performances in the literature. We will see whether a kernel-regularizer can improve the recognition accuracy of the deep models, and how it is compared with the support vector machine (SVM) using the exactly the same kernel. Throughout all the experiments, "kCNN" denotes CNNs regularized by nonlinear kernels, processed ¨ by either Cholesky or Nystrom approximation, with parameters p = 600, m1 = 5000, and m the size of each whole data set. The obtained ui are normalized to have unitary lengths. and are fixed by 1. The remaining two hyperparameters are: the learning rates = {10-3 , 10-4 , 10-5 } and the kernel regularization weights = {102 , 103 , 104 , 105 }. Their values are set once for each of the 4 recognition tasks based on a 5-fold cross validation using 500 labeled examples. 4.1 Handwritten Digit Recognition on MNIST Dataset The data contains a training set with 60000 examples and a test set with 10000 examples. The CNN employs 50 filters of size 7 × 7 on 34 × 34 input images, followed by down-sampling by 1/2, then 128 filters of size 5 × 5, followed by down-sampling by 1/2, and then 200 filters of size 5 × 5, giving rise to 200 dimensional features that are fed to the output layer. Two nonlinear kernels are used: (1) RBF kernel, and (2) Graph kernel on 10 nearest neighbor graph [5]. SVM using RBF kernel reported very good results on MNIST [11], while graph kernel has showed excellent performances on USPS digit data [5]. We perform 600-dimension Cholesky decomposition on the whole 70000 × 70000 graph kernel because it is very sparse. In addition to using the whole training set, we train the models on 100, 600, 1000 and 3000 random examples from the training set and evaluate the classifiers on the whole test set, and repeat each setting by 5 times independently. The results are given in Tab. 1. kCNNs effectively improve over CNNs by leveraging the prior knowledge, and also outperform SVMs that use the same kernels. The results are competitive with the state-of-the-art results by [15], and [18] of a different architecture. 4.2 Gender and Ethnicity Recognition on FRGC Dataset The FRGC 2.0 dataset [13] contains 568 individuals' 14714 face images under various lighting conditions and backgrounds. Beside person identities, each image is annotated with gender and ethnicity, which we put into 3 classes, "white", "asian", and "other". We fix 114 persons' 3014 images (randomly chosen) as the testing set, and randomly selected 5%, 10%, 20%, 50%, and "All" images from the rest 454 individuals' 11700 images. For each training size, we randomize the training data 5 times and report the average error rates. In this experiment, CNNs operate on images represented by R/G/B planes plus horizontal and vertical gradient maps of gray intensities. The 5 input planes of size 140 × 140 are processed by 16 convolution filters with size 16 × 16, followed by max pooling within each disjoint 5 × 5 neighborhood. The obtained 16 feature maps of size 25 × 25 are connected to the next layer by 256 filters of size 6 × 6, with 50% random sparse connections, followed by max pooling within each 5 × 5 neighborhood. The resulting 256 × 4 × 4 features are fed to the output layer. The nonlinear kernel used in this experiment is the RBF kernel computed directly on images, which has demonstrated state-of-the-art accuracy for gender recognition [2]. The results shown in Tab. 2 and Tab. 3 demonstrate that kCNNs significantly boost the recognition accuracy of CNNs for both gender and ethnicity recognition. The difference is prominent when small training sets are presented. 6 Table 2: Percentage error rates of gender recognition on FRGC Training Size 5% 10% 20% 50% All SVM (RBF) 16.7 13.4 11.3 9.1 8.6 ¨ SVM (RBF, Nystrom) 20.2 14.3 11.6 9.1 8.8 CNN 61.5 17.2 8.4 6.6 5.9 kCNN 17.1 7.2 5.8 5.0 4.4 Table 3: Percentage error rates of ethnicity recognition on FRGC Training Size 5% 10% 20% 50% All SVM (RBF) 22.9 16.9 14.1 11.3 10.2 ¨ SVM (RBF, Nystrom) 24.7 20.6 15.8 11.9 11.1 CNN 30.0 13.9 10.0 8.2 6.3 kCNN 15.6 8.7 7.3 6.2 5.8 4.3 Object Recognition on Caltech101 Dataset Caltech101 [7] contains 9144 images from 101 object categories and a background category. It is considered one of the most diverse object databases available today, and is probably the most popular benchmark for object recognition. We follow the common setting to train on 15 and 30 images per class and test on the rest. Following [10], we limit the number of test images to 30 per class. The recognition accuracy was normalized by class sizes and evaluated over 5 random data splits. The CNN has the same architecture as the one used in the FRGC experiment. The nonlinear kernel is the spatial pyramid matching (SPM) kernel developed in [10]. Tab. 4 shows our results together with those reported in [12, 15] using deep hierarchical architectures. The task is much more challenging than the previous three tasks for CNNs, because in each category the data size is very small while the visual patterns are highly diverse. Thanks to the regularization by SPM kernel, kCNN dramatically improves the accuracy of CNN, and outperforms SVM using the same kernel. This is perhaps the best performance by (trainable and hand-crafted) deep hierarchical models on the Caltech101 dataset. Some filters trained with and without kernel regularization are visualized in Fig. 1, which helps to understand the difference made by kCNN. 5 Related Work, Discussion, and Conclusion Recent work on deep visual recognition models includes [17, 12, 15]. In [17] and [12] the first layer consists of hard-wired with Gabor filters, and then a large number of patches are sampled from the second layer and used as the basis of the representation which is then used to train a discriminative classifier. Deep models are powerful in representing complex functions but very difficult to train. Hinton and coworkers proposed training deep belief networks with layer-wise unsupervised pretraining, followed by supervised fine-tuning [9]. The strategy was subsequently studied for other deep models like CNNs [15], autoassociators [3], and for document coding [16]. In recent work [18], the authors proposed training a deep model jointly with an unsupervised embedding task, which also leads to improved results. Though using unlabeled data too, our work differs at the emphasis on leveraging the prior knowledge, which suggests that it can be combined with those approaches, including neighborhood component analysis [8], to further enhance the deep learning. This work is also related to transfer learning [1], which uses auxiliary tasks to learn a linear feature mapping. The work here is motivated differently and deals with more complex nonlinear functions. (a) CNN-Caltech101 (b) kCNN-Caltech101 Figure 1: First-layer filters on the B channel, learned from Caltech101 (30 examples per class) 7 Table 4: Training Size SVM (SPM) [10] ¨ SVM (SPM, Nystrom) HMAX [12] Percentage accuracy on Caltech101 15 30 Training Size 54.0 64.6 CNN (Pretrain) [15] 52.1 63.1 CNN 51.0 56.0 kCNN 15 - 26.5 59.2 30 54.0 43.6 67.4 One may ask, why bother training with kCNN, instead of simply combining two independently trained CNN and SVM systems? The reason is computational speed ­ kCNN pays an extra cost to exploit a kernel matrix in the training phase, but in the prediction phase the system uses CNN alone. In our Caltech101 experiment, the SVM (SPM) needs 6 seconds to process a new image on a PC with a 3.0 GHz processor, while in the same amount of time kCNN can process about 240 images. The latest record on Caltech101 combined multiple kernels [4]. We conjecture that kCNN could be further improved by using multiple kernels without sacrificing recognition speed. Conclusion: We proposed using kernels to improve the training of deep models. The approach has been implemented by stochastic gradient descent, and demonstrates excellent performance on a range of visual recognition tasks. Our experiments showed that prior knowledge significantly improves the performance of CNNs when insufficient labeled data are available in hard recognition problems. The trained model is much faster than kernel systems for making predictions. References [1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. In Journal of Machine Learning Research, 2005. [2] S. Baluja and H. Rowley. Boosting sex identification performance. Journal of Computer Vision, 2007. [3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Neural Information Processing Systems (NIPS), 2007. ~ [4] A. Bosch, A. Zisserman, and X. Munoz. Image classification using ROIs and multiple kernel learning. 2008. submitted to International Journal of Computer Vison. ¨ [5] O. Chapelle, J. Weston, and B. Scholkopf. Cluster kernels for semi-supervised learning. In Advances in Neural Information Processing Systems 15, 2003. [6] M. Fazel, H. Hindi, and S. Boyd. Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. In American Control Conf., 2003. [7] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. CVPR Workshop, 2004. [8] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17, 2005. [9] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 ­ 507, July 2006. [10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR'06, 2006. [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278­2324, 1998. [12] J. Mutch and D. G. Lowe. Multiclass object recognition with sparse, localized features. CVPR'06, 2006. [13] P. J. Philips, P. J. Flynn, T. Scruggs, K. W. Bower, and W. Worek. Preliminary face recognition grand challenge results. IEEE Conf. on Automatic Face and Gesture Recgonition, 2006. [14] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In Proceedings of the International Conference on Machine Learning (ICML), 2007. [15] M. Ranzato, F.-J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. CVPR'07, 2007. [16] M. Ranzato and M. Szummer. Semi-supervised learning of compact document representations with deep networks. In International Conferenece on Machine Learning (ICML), 2008. [17] T. Serre, L. Wolf, and T. Poggio. Object recognition with features inspired by visual cortex. CVPR, 2005. [18] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In Proceedings of International Conference on Machine Learning (ICML), 2008. ¨ [19] C. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. In Advances in Neural Information Processing Systems 13. MIT Press, 2001. 8