Deep networks for robust visual recognition

Yichuan Tang y3tang@uwaterloo.ca Chris Eliasmith celiasmith@uwaterloo.ca Centre for Theoretical Neuroscience, University of Waterloo, Waterloo ON N2L 3G1 CANADA

Abstract
Deep Belief Networks (DBNs) are hierarchical generative models which have been used successfully to model high dimensional visual data. However, they are not robust to common variations such as occlusion and random noise. We explore two strategies for improving the robustness of DBNs. First, we show that a DBN with sparse connections in the first layer is more robust to variations that are not in the training set. Second, we develop a probabilistic denoising algorithm to determine a subset of the hidden layer nodes to unclamp. We show that this can be applied to any feedforward network classifier with localized first layer connections. Recognition results after denoising are significantly better over the standard DBN implementations for various sources of noise.

our definition of noise in this paper. To improve the robustness of the DBN, we introduce a modified version of the DBN termed a sparse DBN (sDBN) where the first layer is sparsely (and locally) connected. This is in part inspired by the properties of the human visual system. It is well-established that the lower cortical levels represent the visual input in a local, sparsely connected topographical manner (Hubel & Wiesel, 1959). We show that a sDBN is more robust to noise on the MNIST (LeCun et al., 1998) dataset with noise added to the test images. We then present a denoising algorithm which combines top-down and bottom-up inputs to "fill in" the subset of hidden layer nodes which are most affected by noise. (Lee & Mumford, 2003) proposed that the human visual cortex performs hierarchical Bayesian inference where "beliefs" are propagated up and down the hierarchy. Our attention-esque top-down feedback can be thought of as a type of "belief" that helps to identify object versus non-object (noise) elements in the visible layer.

1. Introduction
Deep Belief Networks (DBNs) are hierarchical generative models with many latent variables that effectively model high dimensional visual image data (Hinton et al., 2006). A DBN is trained by a greedy layerby-layer unsupervised learning algorithm on a series of bipartite Markov Random Field (MRF) known as a Restricted Boltzmann Machine (RBM). Fine tuning by the up-down algorithm or discriminative optimization results in a deep network capable of fast feedforward classification (Hinton & Salakhutdinov, 2006; Salakhutdinov & Hinton, 2009). DBNs model all the pixels in the visible layer probabilistically, and as a result, are not robust to images with "noise" which are not in the training set. We include occlusions, additive noise, and "salt" noise in
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

2. Related Work
Sparsely connected weights have been widely used in visual recognition algorithms (Fukushima, 1983; LeCun et al., 1998; Serre et al., 2005). Most of these algorithms contain a max-pooling stage following a convolutional stage to provide a certain amount of translational and scale invariance. Recently there has been work combining the convolutional approach with the DBN (Lee et al., 2009; Norouzi et al., 2009). These efforts enforce sparse connections similar in spirit to those enforced here. However, unlike those methods, our main motivation is not to provide translational invariance and/or to reduce the number of model parameters, but rather to diminish the effect of noise on the activations of hidden layer nodes. In addition, our algorithm does not require weight sharing (applying the same filter across an image), which would increase the total number of hidden layer nodes and increase the computational complexity of our denoising algo-

Deep networks for robust visual recognition

rithm. (Welling et al., 2002; Roth & Black, 2005) also learned MRFs to model the prior statistics of images for denoising and inpainting. Whereas those methods model at the pixel level and explicitly specify a noise likelihood, our proposed algorithm uses the prior over the first hidden layer to estimate the subset of nodes which are affected by noise. This allows the method to be agnostic about the noise likelihood distribution. Finally, we evaluate our methods on the widelyused MNIST handwritten digit classification task, where the state-of-the-art performance is currently 0.53% (Jarrett et al., 2009) for domain knowledge based methods and 0.95% (Salakhutdinov & Hinton, 2009) for permutation invariant methods.

images is expected since a DBN models the joint probability of 28x28 = 784 pixels, all with black borders. Therefore, test images with white borders are not probable and the ensuing classification is not very accurate. Of course, when images with these variations were to added to the training set, we obtained better recognition results. However, due to the impractical nature of adding all possible noise that might exist in a real world environment, it is desirable to have a DBN which is more robust to out-of-sample test data before resorting to enlarging the training set. 3.1. Why Sparseness In this paper we use V , H 1 , H 2 , H 3 to refer to each of the layers (see figure 3), and V = v to denote a specific activation of layer V . We will also use q(·) to refer to the approximate posterior computed by the recogni1 tion weights. Specifically, q(h1 |v) =  (Wrec )T v + c and (·) is the logistic function. We improve the robustness of the DBN by first reducing the effect that a noisy image V = v has on the ~ hidden layer activation q(h1 |~). We accomplish this v by specifying sparse connections between each hidden layer node and a spatially local group of visible layer nodes in the first RBM. We use sRBM to refer to this even more restricted type of RBM. For example, each h1 node is randomly assigned a 7x7 receptive field (RF), and it has no connections to visible nodes outside of its RF. With local connections, noise or occlusion in one subset of V nodes will only affect a subset of H 1 nodes. The main motivation here is to reduce the change between H 1 activation given the noisy image, q(h1 |~), and H 1 activation given the original image, v q(h1 |v). 3.2. Sparse RBM Learning The basic building block of a DBN is the RBM. A full account of RBM training and DBN formation is described in (Hinton et al., 2006; Bengio et al., 2007). An RBM with visible layer nodes V = v and hidden layer nodes H = h is defined by an energy function E(v, h; ) = -bT v - cT h - vT Wh (1)

3. Sparsely Connected DBN
While the first layer weights of a standard DBN are somewhat spatially localized, they are not forced to be zero past a given radius. Consequently, the small but significant weight values affect a given hidden node's activation if any noise are present anywhere in the image. Classification results are likewise affected, making DBNs not robust to various types of noise. For instance, figure 1 gives examples of noisy images and their respective classification errors of a DBN. This DBN was trained according to (Hinton et al., 2006), followed by 30 epochs of discriminative optimization and achieves 1.03% test error on the clean images. However, error dramatically increased under various types of noise. These particular kinds of noise were chosen to reflect various possible sources of error for which biological visual systems are robust. The first is the simple introduction of a border that does not overlap with the foreground of the digits. The second is the occlusion by a rectangular block random in size and location. The third is the corruption of the images by random noise.

Figure 1. DBN fails to be robust to various noise. The noise include added borders, randomly placed occluding rectangles, and random pixels toggled to white.

where  = {W, b, c} are the model parameters. The probability distribution of the system {v, h} can be written as: p(v, h) = exp-E(v,h) p (v, h) = Z() Z() (2)

The poor classification performance on the noisy test

where Z() is the normalization constant: Z() = -E(v,h) . Exact maximum likelihood learning v,h exp

Deep networks for robust visual recognition

is intractable due to the computation of an expectation w.r.t. the model's distribution. In practice, learning is often performed using n-step Contrastive Divergence (CD) (Hinton, 2002), where the weights are updated as: Wij  Edata [vi hj ] - Erecons [vi hj ] (3) Erecons [·] represent the expectation w.r.t. the distribution after n steps of block Gibbs sampling starting at the data1 . When learning a sRBM, the only modification needed is to zero out the weights connecting each hidden node to visible nodes that are outside of its RF: Wij  (Edata [vi hj ] - Erecons [vi hj ])Wij where Wij = 1 0 if vi is in hj 's RF otherwise (5) (4)

Table 1. Sparse RBM and sparse DBN evaluations RF size # hidden log probability sDBN error 7x7 500 -94.62 1.19% 1000 -92.50 1.20% 1500 -91.77 1.60% 10x10 500 -91.53 1.17% 1000 -90.16 1.24% 1500 -89.78 1.55% 12x12 500 -90.30 1.18% 1000 -89.72 1.16% 1500 -89.56 1.63%

Additional computational efficiency can be obtained by using sparse matrix operations for learning and inference. 3.3. Sparse RBM Evaluation As the RF approaches 28x28 (the dimension of the visible layer for the MNIST digits), the sRBM approaches the standard RBM. Using a 7x7 RF instead of the standard 28x28 reduces the number of weights for the first layer RBM by a factor of 16. Certainly a concern is whether or not this sRBM would still be a good generative model of the data. To find the average log probability of the test set, we estimated the normalization constant Z() for each sRBM by using the Annealed Importance Sampling algorithm (Neal, 2001). Following (Salakhutdinov & Murray, 2008), we performed 100 annealing runs using around 15,000 intermediate distributions. Table 1 shows the estimated average test log probability for various sparse RBMs. It also shows the error rate of the DBNs built from these sparse RBMs (section 3.4). The log probability is positively correlated with RF size and the number of hidden layer nodes. While not shown on this table, it is worth noting that the best 12x12 sRBM achieves a log probability that is only about 3 nats below an equivalently trained standard RBM. In addition, the worst sRBM considered here (7x7, 500 hidden nodes), is still about 11 nats better than a standard RBM trained using 3step CD (Salakhutdinov & Murray, 2008). Figure 2 shows some of the filters learned by a 7x7 sRBM on MNIST.
In our experiments, we use 25-step CD for sRBM training, with a learning rate of 0.1 for 50 epochs.
1

Figure 2. Filters from a sRBM with 7x7 RF learned on MNIST.

3.4. Sparse DBN A DBN can be constructed with a hierarchical series of RBMs. We train our 2nd level RBM in the standard way and allow for full connectivity between layers H 1 and H 2 . A greedy layer-wise training procedure is used where q(h1 |v) is treated as the visible layer data for the 2nd level RBM. A sDBN is then formed by stacking together the RBMs and fine tuning the entire network using the up-down algorithm (Hinton et al., 2006). Alternatively, we can convert the sDBN into a deterministic feedforward network and minimize the cross-entropy error (Bengio et al., 2007). An example of such a network is shown in figure 3, where the 1,2,3,4 the rec weights, Wrec form a feedforward classifier. 2 1 Layer Z, Wdenoise and Wgen are part of the denoising process described in section 4. Specifically, the sDBN in our experiments has the same size and depth as the DBN in (Hinton et al., 2006), but its first layer is sparse with 7x7 RFs. It is fine tuned using the up-down algorithm for 300 epochs before discriminatively optimized for 30 more epochs2 . Figure 4 shows the recognition errors on the noisy test set of the sDBN using only feedforward weights. Significant improvements can be seen for all types of noise.
We use Conjugate gradient method to minimize the cross-entropy error with training data divided into batches of 5K each.
2

Deep networks for robust visual recognition

Output

are the weights of this new RBM, which has Z (with 1000 nodes) as its hidden layer (figure 3). This RBM's energy function and the marginal of h1 are
2 E(h1 , z) = -dT h1 - eT z - (h1 )T Wdenoise z

(6)

exp-E(h ,z) -E(h1 ,z) h1 ,z exp (7) Note that log p (h1 ) can be calculated analytically due to the bipartite nature of the RBM. We trained 2 Wdenoise for 600 epochs by using 100 persistent Markov chains to estimate the model's expectations (Tieleman, 2008). This method is known as Persistent CD and (compared to CD) can learn a better model for a fixed amount of computation. p(h1 ) = p (h1 ) = -E(h1 ,z) h1 ,z exp
z

1

Figure 3. A deep network for feedforward recognition with denoising. Upward arrows are feedforward recognition weights, the downward dashed arrow is the generative weight, and the bidirectional dashed arrow is the weight 1 of the denoising RBM. Wgen is part of the DBN and is used to calculate p(v|h1 ). If the network is not a DBN we 1 can easily learn Wgen to predict the data v given h1 .

The idea of denoising before classification can be understood schematically as depicted in figure 5, which shows a plot of noisy images above their unnormalized log probability log p (h1 ). Not surprisingly, highly noisy test images have much smaller log p (h1 ) and would be farther away from regions of high density. The dashed arrows indicate how we would like to denoise a noisy image by moving it (not necessarily one shot) to a region in state space of higher probability, putting it in a better region for classification.

Figure 4. A sparse DBN is more robust to noise than the standard DBN, and only slightly worse on the clean images.

4. Probabilistic Denoising
When noise is present during recognition, the affected H 1 nodes increase the error rate. This is an out-ofsample problem, where an affected q(h1 |~) have low v probability as defined by the training set. Classification boundaries in regions of state space with low probability cannot be trusted due to the lack of training data in those regions. Therefore, we seek to reduce the error rates by denoising h1 using a generative model of q(h1 |v)3 . We accomplish this by learning a separate denoising 2 RBM that uses q(h1 |v) as its visible data4 . Wdenoise
3 While denoising can also be done at the V layer, we prefer H 1 due to its more abstract representation of the input and smaller dimensionality. 4 When the sDBN is fine tuned as a generative model by the up-down algorithm, we would ideally want to denoise using the p(h1 ) defined by all the higher layers of the sDBN. However, we can only approximate the lower variational bound on log p (h1 ) by drawing samples

Figure 5. A hypothetical state space with the dark band being the region of high probability. See text for details.

4.1. Denoising via Unclamping If we know which of the nodes in H 1 are affected, we can denoise by "filling in" their values by sampling from the distribution conditioned on all other H 1 nodes in the denoising RBM. For example, in figure 6, the two left most nodes of H 1 are unclamped while the rest are clamped. Let us use j  {0, 1} = 1 to denote the unclamping of node h1 and j = 0 the clamping of h1 . We run 50 j j
from q(h2 |h1 ) (see (Salakhutdinov & Murray, 2008)). In contrast, a separate denoising RBM allows its model of log p (h1 ) to be calculated exactly.

Deep networks for robust visual recognition

j-th node replaced by p(h1 = 1|h1 ), which is given j \j by
p(h1 = 1|h1 ) = j \j exp(dj ) Figure 6. The shaded nodes are clamped. Denoising is performed by running block Gibbs sampling on the unclamped nodes. exp(dj ) and
2 W\j

Q Nz
k

2 (1 + exp(k + Wjk )) QNz 2 (1 + exp(k + Wjk )) + k (1 + exp(k )) (10) k

Q Nz

2  = (W\j )T h1 + e \j

(11)

iterations of block Gibbs sampling to sample H 1 nodes using 2 p(zk |h1 ) =  Wjk h1 + ek (8a) j
j

where is omitting the j-th row, e is the bias to layer Z, d is the bias to H 1 , and Nz is the number of nodes in layer Z. We can then estimate which of the H 1 nodes to unclamp by using a threshold j,t = 1 0 if j log p(h1 ) > (t) t otherwise (12)

2 Wdenoise

p(hj |z) = 
k

2 Wjk zk + dj

(8b)

where we only use 8b to update the unclamped (j = 1) nodes. After denoising, we denote the H 1 activation as g. Figure 7 shows denoised results v = ^ 1  Wgen g+b using the above method when the noisy hidden nodes or j are explicitly specified. It is clear that if the noisy nodes are correctly identified, correct classification will be much easier.

where (t) is a constant decreasing with time. j,t is then used in the calculation of gt , as described in section 4.1. We update the hidden layer activations in the next time step to h1 - gt t+1 (13)

Figure 7. The first row are occluded images, the second row are the denoised results, and the third row are the original images.

The standard Bayesian approach to denoising is to specify a prior over p(h) and then to find the MAP estimate of p(h|h), where h is the noisy H 1 activation. In contrast, we try to optimize log p(h) with respect to the parameters . In our algorithm, unclamping node hj is similar to specifying the noise likelihood to be flat for node j: p(hj |hj )  constant; while clamping node hj is similar to specifying the Dirac delta for the noise likelihood of node j: p(hj |hj ) = (hj - hj ). 4.3. Combining with Visible Layer Inputs

4.2. Determining Which Nodes to Unclamp During recognition, a DBN does not know which H nodes to unclamp. We present here an iterative denoising algorithm which uses the gradient of log p(h1 ) of the RBM defined in eqs. 6 and 7 to determine which hidden layer nodes to unclamp. Denoting h1 = 0 q(h1 |~) to be the initial H 1 activation at time step 0, v we estimate  0 and compute g0 . Setting h1 = g0 , we 1 repeat this process for several time steps. The discrete gradient of the log probability with respect to j at time step t is given as:
j 1

Having obtained a denoised gt , we can simply use gt as our H 1 activation and compute q(h2 |gt ), q(h3 |h2 ), etc. all the way up to the output for classification. However, it is much better if we also take into account the bottom-up inputs from V . This idea comes naturally for the Deep Boltzmann Machine (DBM) (Salakhutdinov & Hinton, 2009), where due to the fact that H 1 has undirected connections from both V and H 2 , p(h1 |h2 , v) involves both v and h2 . Since V layer nodes contain noise, we do not want to use the unreliable bottom-up influences directly. Instead, we would like to attenuate the noise part of V with an attention-like multiplicative feedback gating signal u = [0, 1]. The attenuated bottom-up influence would be
1 q(h1 |v; u) =  (Wrec )T (v

log p(h1 ) = log p (h1 ) - log p (h1 ) t t \j,t

(9)

which is evaluated at  = 0. We denote h1 to be the \j set of all nodes in H 1 except h1 . h1 is h1 with the t j \j,t

u) + c

(14)

Deep networks for robust visual recognition

where we denote to be element-wise multiplication. v u is the multiplicative interaction and partly inspired by visual neuroscience. Recently, there are mounting neurophysiological evidence for considerable attentional modulation of early visual areas such as V1 (Posner & Gilbert, 1999; Buffalo et al., 2010). fMRI studies of human subjects performing recognition tasks with distractors have suggested that attentional modulation could be a delayed feedback to V1 from higher cortical areas (Mart´ inez et al., 1999). Attention can also be stimulus dependent and has been shown to affect visual processing both spatially and featurespecifically (Treue & Mart´ inez Trujillo, 1999). In one interesting study, (Lamme, 1995) showed that neurons in Macaque V1 responded better to texture in the foreground than to similar textures in the background 3040 ms after onset of activation. The nonlocality nature and temporal latency of reponse differences strongly suggest feedback from higher visual areas. By using u in our algorithm, we introduce a very simple method for dealing with noisy V layer nodes. To compute u we use5
1 u = 1 - |v - p(v|g; Wgen )|

Algorithm 1 Sparse DBN Training and Inference Learning: 1: Learn sRBM using eq. 4. 2: Greedy pretraining of higher layer RBMs and stack to form a sDBN. 3: Fine tune using the up-down algorithm. 4: Convert the sDBN into a discriminative classifier and minimize cross-entropy error. 2 5: Learn Wdenoise using q(h1 |v) as input. 1 6: Learn Wgen by minimizing cross-entropy between the data and p(v|q(h1 |v)). Recognition: 1: For noisy input v , compute h1 = q(h1 |~ ). ~ v 0 for t = 1 to n do 2: Estimate  t using eq. 12 3: Gibbs sampling to obtain gt using eq. 8 4: Combine with bottom up input to obtain gcombined,t using Eq. 16 5: h1 - gcombined,t t+1 end for 6: Compute q(h2 |h1 ), then feedforward to output. n+1

(15)

4.5. Recognition Results For recognition, we performed 10 iterations of denoising with (t) decaying from 2.0 to 0.2 for each test image. After denoising, we proceeded with the feedforward recognition by computing q(h2 |g10 ) and feedfoward to the output using the rec weights. In table 2, we summarize the error rates on MNIST for all the networks. The 7x7+denoised line has the error rates found after denoising. Denoising provides a large improvement over the accuracy of the sDBN for noisy images. However, the denoising sDBN is slightly worse than standard DBN on the clean images. This effect is hard to avoid since denoising seeks to increase probability of h1 defined over all 10 digits and may cross classification boundaries. For comparison, we also trained a standard DBN and a 7x7 sDBN with noise added evenly to the 60K MNIST training set. They are fine tuned with 300 epochs of up-down algorithm followed by 30 epochs of discriminative optimization. The results show that sparse connections are better for recognition in this case as well. It is also revealing that in comparison to the denoising sDBN (trained only on clean images), the error rates is only lower on the block occluded test images.

1 where Wgen is the first layer's generative weights. To combine the modulated bottom-up input with the denoised activation g, we compute a weighted average based on the amount of noise in the RF of a hidden node

gcombined = q(h1 |v; u)

uT W uT W +g (1- 2 ) (16) 2  

where  is the size of the RF and W is defined by eq. 5. To update our hidden layer activation in the next time step, we modify eq. 13 to be h1 - gcombined,t t+1 (17)

The entire training and inference process for the sDBN is summarized in Algorithm 1. 4.4. Denoising Results In our experiments, we used 6 denoising iterations (t = 1 to t = 6) with a linearly decaying (t) from 1.0 to 0.0. Results were similar for other (t) and number of iterations. Figure 8 shows the intermediate denoising results. The combination of the top-down and bottom-up signals is vital to good results. Besides the aforementioned types of noise, we also experimented with pepper noise and occlusions by crossed lines.
5

5. Discussion
It should be noted that the specific approach taken here does not depend on our adoption of the DBN.

We can interpret ui to be p(vi = noise|g).

Deep networks for robust visual recognition Table 2. Network 28x28 DBN 7x7 sDBN 7x7+denoised 28x28+noise 7x7+noise Summary of recognition results clean border block random 1.03% 66.14% 33.78% 79.83% 1.19% 2.46% 21.84% 65.50% 1.24% 1.29% 19.09% 3.83% 1.68% 1.95% 8.72% 8.01% 1.61% 1.77% 8.39% 6.64%

future work is in the improvement of estimating the occluding object or . Currently, denoising takes place on the hidden layer. It is also possible to denoise in the visible layer. Even though preliminary results of applying our denoising algorithm on the visible layer alone suggest that it is quite difficult, the combination of denoising on both the hidden and visible layers may give better results. Denoising at higher layers is also possible. However, due to the fact that the RFs of the first hidden layer are chosen randomly, H 1 is not topographically ordered. It is certainly possible to organize H 1 to be topographical and enforce sparse connections to H 2 , thereby making denoising h2 effective.

(a) Successful examples

6. Conclusions
(b) Failed examples Figure 8. Denoised results on various types of noise. The first column from the left contains the original images, the second column contains images with noise added. Subsquent columns represent the denoised images from t = 1 to t = 6.

That is, if the network is not a DBN fine tuned by the 1 up-down algorithm, Wgen can be learned by maximum likelihood estimation. Consequently, this denoising algorithm can be easily adapted to any deep feedforward classifier as long as the first layer has spatially localized receptive fields. There are several avenues for extending the present model. For one, human visual recognition of partially visible objects is more accurate if the occluding object can be identified (Fukushima, 2001; Johnson & Olshausen, 2005). In our experiments, when the block occluded region is known, denoising is much better. Compare the results of the block occlusion from figure 7 with those of failed examples from figure 8. Accordingly, recognition error is reduced from 19% to 10% for the block occlusion noise test set. We hypothesize that the identification of the occluder is similar to specifying . Therefore, an important avenue for

In this paper, we have demonstrated that combining sparsification with explicit denoising results in a DBN that is much more robust to noise not in the training set than a standard DBN. We introduced an algorithm which is capable of denoising a test image by combining top-down influences with bottom-up inputs. Our denoising process does not model the noise process at all, but instead uses the log probability to estimate which nodes should be unclamped. It is able to handle a variety of noise and is inspired by findings in neurophysiology. Finally, the denoising itself can be adapted to a broad class of deep feedforward networks, making such an approach likely to be useful for other architectures not explored here.

Acknowledgements
We thank the anonymous reviewers for making this a much better manuscript. This research was supported by NSERC.

References
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In Adv. in Neural Information Processing Systems 19, pp. 153­160, 2007.

Deep networks for robust visual recognition

Buffalo, E. A., Fries, P., Landman, R., Liang, H., and Desimone, R. A backward progression of attentional effects in the ventral stream. Proceedings of the National Academy of Sciences, 107(1):361­365, Jan. 2010. Fukushima, K. Neocognitron: A neural model for a mechanism of visual pattern recognition. IEEE Trans. SMC, 13(5):826­834, 1983. Fukushima, K. Recognition of partly occluded patterns: A neural network model. Biological Cybernetics, 84(4):251­259, 2001. Haenny, P. E. and Schiller, P. H. State dependent activity in monkey visual cortex. I. single cell activity in V1 and V4 on visual tasks. Experimental Brain Research, 69 (2):225­244, 1988. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771­1800, 2002. Hinton, G. E. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313:504­507, 2006. Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527­1554, 2006. Hubel, D. and Wiesel, T. Receptive fields of single neurons in the cats striate cortex. Journal of Physiology, 148:574­591, 1959. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. What is the best multi-stage architecture for object recognition? In Proc. Intl. Conf. on Computer Vision (ICCV'09). IEEE, 2009. Johnson, J. S. and Olshausen, B. A. The recognition of partially visible natural objects in the presence and absence of their occluders. Vision Research, 45 (25-26):3262­3276, Nov. 2005. Lamme, V. A. The neurophysiology of figure-ground segregation in primary visual cortex. The Journal of neuroscience: the official journal of the Society for Neuroscience, 15:1605­1615, 1995. LeCun, Y., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278­2324, Nov. 1998. Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Intl. Conf. on Machine Learning, pp. 609­616, 2009.

Lee, T. S. and Mumford, D. Hierarchical bayesian inference in the visual cortex. Journal of the Optical Society of America, 20:1434­1448, 2003. Mart´ inez, A., Anllo-Vento, L., Sereno, M. I., Frank, L. R., Buxton, R. B., Dubowitz, D. J., Wong, E. C., Hinrichs, H., Heinze, H. J., and Hillyard, S. A. Involvement of striate and extrastriate visual cortical areas in spatial attention. Natural Neuroscience, 2 (4):364­369, Apr. 1999. Neal, R. M. Annealed importance sampling. Statistics and Computing, 11:125­139, 2001. Norouzi, M., Ranjbar, M., and Mori, G. Stacks of convolutional restricted boltzmann machines for shiftinvariant feature learning. In IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2735­ 2742, 2009. Posner, M. I. and Gilbert, C. D. Attention and primary visual cortex. Proc. of the National Academy of Sciences, 96(6), March 1999. Roth, S. and Black, M. J. Fields of experts: A framework for learning image priors. In IEEE Conf. on Computer Vision and Pattern Recognition, pp. 860­ 867, 2005. Salakhutdinov, R. and Hinton, G. Deep Boltzmann machines. In Proceedings of the Intl. Conf. on Artificial Intelligence and Statistics, volume 5, pp. 448­ 455, 2009. Salakhutdinov, R. and Murray, I. On the quantitative analysis of deep belief networks. In Proceedings of the Intl. Conf. on Machine Learning, volume 25, 2008. Serre, T., Wolf, L., and Poggio, T. Object recognition with features inspired by visual cortex. In IEEE Conf. on Computer Vision and Pattern Recognition, pp. 994­1000, 2005. Tieleman, T. Training restricted boltzmann machines using approximations to the likelihood gradient. In Intl. Conf. on Machine Learning, volume 307, pp. 1064­1071, 2008. Treue, S. and Mart´ inez Trujillo, J. C. Featurebased attention influences motion processing gain in macaque visual cortex. Nature, 399(6736):575­579, Jun. 1999. Welling, M., Hinton, G. E., and Osindero, S. Learning sparse topographic representations with products of student-t distributions. In Adv. in Neural Information Processing Systems, pp. 1359­1366, 2002.