Restricted Boltzmann Machines are Hard to Approximately Evaluate or Simulate Philip M. Long Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA Rocco A. Servedio Computer Science Department, Columbia University, New York, NY 10027 plong@google.com rocco@cs.columbia.edu Abstract Restricted Boltzmann Machines (RBMs) are a type of probability model over the Boolean cube {-1, 1}n that have recently received much attention. We establish the intractability of two basic computational tasks involving RBMs, even if only a coarse approximation to the correct output is required. We first show that assuming P = NP, for any fixed positive constant K (which may be arbitrarily large) there is no polynomial-time algorithm for the following problem: given an n-bit input string x and the parameters of a RBM M , output an estimate of the probability assigned to x by M that is accurate to within a multiplicative factor of eKn . This hardness result holds even if the parameters of M are constrained to be at most (n) for any function (n) that grows faster than linearly, and if the number of hidden nodes of M is at most n. We then show that assuming RP = NP, there is no polynomial-time randomized algorithm for the following problem: given the parameters of an RBM M , generate a random example from a probability distribution whose total variation distance from the distribution defined by M is at most 1/12. (henceforth simply denoted "RBM") with m hidden nodes is defined by an m by n real matrix A and two vectors a Rm , b Rn . These parameters = (A, a, b) define a probability distribution RBM over x {-1, 1}n in the following way: RBM (x) = def h{-1,1}m eh T a+hT Ax+bT x Z , (1) where Z is a normalizing factor (sometimes referred to as the "partition function") so that RBM is a probability distribution, i.e. Z = def h{-1,1}m ,z{-1,1}n eh T a+hT Az+bT z . While RBMs were first introduced more than two decades ago (Smolensky, 1987; Freund & Haussler, 1991), they have recently been used as constituents of "deep belief network" learning systems (Hinton et al., 2006; Bengio, 2009). An approach to training deep networks that has been growing in popularity involves unsupervised training of RBMs as a subroutine. The success of this approach (see (Hinton et al., 2006; Larochelle et al., 2007; Erhan et al.)) motivates the subject of this paper, which is to study the complexity of basic computational tasks related to learning with RBMs. Since RBMs are a way of modeling probability distributions over {-1, 1}n, the two most natural computational tasks regarding RBMs would seem to be the following: 1. Evaluating an RBM: Given a parameter vector and an input vector x {-1, 1}n, the task is to output the probability value p = RBM (x) that the distribution assigns to x. A potentially easier task is to approximately evaluate an RBM to within some multiplicative factor: 1. Introduction A Restricted Boltzmann Machine (Smolensky, 1987; Freund & Haussler, 1991; Hinton, 2002; Bengio, 2009) Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Hardness for Restricted Boltzmann Machines given a parameter vector , a vector x {-1, 1}n, and an approximation parameter c > 1, the task is to output a value p such that ^ 1 · p p c · p. ^ c 2. Simulating an RBM distribution: Given a parameter vector and an approximation parameter 0 < < 1, the task is to output an efficiently evaluatable representation1 of a probability distribution P over {-1, 1}n such that the total variation distance between P and RBM is at most . This paper shows that each of these tasks is computationally hard in the worst case, even if only a coarse approximation to the correct output is required. As our first main result, we show that if P = NP then the approximate evaluation task cannot be solved in polynomial time even with approximation parameter c = eKn , where K > 0 may be any fixed constant. (A precise statement of our hardness result, which is somewhat technical, is given as Theorem 8 in Section 4.) As our second main result, we show that if RP = NP then there is no polynomial-time algorithm for the simulation task with approximation parameter = 1/12. (See Theorem 13 in Section 5 for a precise statement.) These results show strong worst-case limitations on evaluating and simulating general RBMs, but in many cases one may only be dealing with RBMs that have moderate-sized weights and relatively few hidden nodes. Thus it is of special interest to understand the complexity of approximately evaluating and simulating RBMs of this sort. We consider the approximate evaluation and simulation tasks when the RBM is restricted to have at most n hidden nodes and each real parameter Ai,j , ai , bj has magnitude at most (n). Our hardness results stated above hold for any bound (n) that grows faster than linearly, i.e. such that limn (n) = . n 1.1. Related work. Our results have the same high-level flavor as the work of several previous researchers who have studWe formalize the notion of an "efficiently evaluatable representation of a distribution over {-1, 1}n " in the following standard way: Such a representation consists of a Boolean circuit of poly(n) size with k = poly(n) input bits and n output bits. A draw from the distribution is obtained by setting the input bits to an uniform random k-bit string and reading the output string. 1 ied the computational complexity of various computational tasks for different types of probability distribution models. In an early work of this sort, Abe and Warmuth (1992) showed that it is NP-hard to approximate the maximum likelihood model for probabilistic automata with a fixed number of states but a variablesized alphabet. Kearns et al. (1994) showed that it is #P -hard to exactly evaluate probability models over {0, 1}n in which each observed bit is a disjunction of three unknown hidden variables, each of which is assigned an unknown independent uniform random bit. Roth (1996) showed that it is NP-hard to approximate the probability that a given node in a multipleconnected Bayesian belief network is true. Bogdanov, Mossel and Vadhan (2008) studied the complexity of various computational tasks related to Markov Random Fields. Yasuda and Tanaka (2008) claim that training RBMs is NP-hard, but such a claim does not seem to address the approximate generation and approximate simulation computational tasks that we consider. To the best of our knowledge, ours is the first work establishing hardness of either evaluation or simulation (exact or approximate) for RBMs. 1.2. Our approach. An important ingredient in our hardness results for both evaluation and simulation is a more technical result (Theorem 1 in Section 3) which shows that it is NP-hard to even coarsely approximate the value of the partition function Z . We prove this by combining a binary search technique (Lemma 6) with a recent hardness result for approximating the pseudo-cut-norm of a matrix due to Alon and Naor (2004). Our hardness result for approximate evaluation, Theorem 8, follows rather straightforwardly from Theorem 1. For our simulation hardness result, we combine Theorem 1 with ideas from Jerrum, Valiant and Vazirani's proof (1986) that approximate uniform sampling from a set S and approximate counting of the elements of S are polynomial-time equivalent. 2. Background and Preliminaries We drop the subscript from Z when there is no possibility of confusion. We sometimes concentrate on models whose parameters = (A, a, b) have a = 0 and b = 0. In this case, we sometimes replace with A, for example writing RBMA and ZA . For A a matrix with real entries, we write ||A|| to denote maxi,j |Aij |. Similarly we write a to denote maxi |ai | for a vector a. Hardness for Restricted Boltzmann Machines For probability distributions P and Q over a finite def set S, the total variation distance is dT V (P, Q) = maxES |P (E) - Q(E)|. Note that this is equivalent to dT V (P, Q) = maxES P (E) - Q(E) since P (S - E) - Q(S - E) = Q(E) - P (E). A function is (n) if it grows faster than linearly, i.e., if limn (n) = . n def Proof: Alon and Naor (2004) note that the pseudocut-norm satisfies ||A|| u,v{0,1}n max uT Av (the RHS above is the actual "cut-norm" of A). Since Aij equals eT Aej (where ei has a 1 in the ith coordii nate and 0's elsewhere), we get ||A|| max |Aij | = A i,j . 3. Approximating the partition function is hard The main result of this section is the following theorem, which says that it is hard to approximate the partition function Z : Theorem 1 There is a universal constant > 0 such that if P = N P , then there is no polynomial-time algorithm with the following property: Given as input an n × n matrix A satisfying A (n) (where the function satisfies (n) = (n)), the algorithm approximates the partition function ZA to within a multiplicative factor of e(n) . Our proof uses a reduction from the problem of approximating a norm defined by Alon and Naor (2004). We refer to this norm, which is denoted by A 1 in (Alon & Naor, 2004), as the pseudo-cut-norm of A and denote it simply by A . Definition 2 The pseudo-cut-norm of an m × n real matrix A is defined by ||A|| = def It remains only to observe that for any h {-1, 1}m, x {-1, 1}n, we have hT Ax = ij hi xj Aij mn||Aij || . 3.1. Proof of Theorem 1 Throughout this section denotes a function (n) = (n) as in Theorem 1. We first show that it is hard to distinguish between matrices with "large" versus "slightly less large" pseudocut-norm: Lemma 6 There is a universal constant > 0 such that if P = N P , then there is no polynomial-time algorithm to solve the following promise problem: Input: An n-by-n matrix A such that ||A|| (n) and either (i) ||A|| > (n); or (ii) ||A|| (1 - )(n). Output: Answer whether (i) or (ii) holds. Proof: The proof is by contradiction; so suppose that for every > 0, ALG is a polynomial-time algorithm that solves the promise problem with parameter . We will show that there is a polynomial-time algorithm ALG (which performs a rough binary search using ALG as a subroutine) that can approximate the pseudo-cut-norm B of any n × n input matrix B to within a multiplicative factor a contradiction with Corollary 4. 1 1- 2 h{-1,1}m ,x{-1,1}n max hT Ax. Theorem 3 ((Alon & Naor, 2004)) There is a universal constant > 0 such that, if P = NP, then there is no polynomial-time algorithm that approximates the pseudo-cut-norm to within a factor of 1 + . The reduction in the proof of Theorem 3 in (Alon & Naor, 2004) uses non-square matrices, but an easy corollary extends this hardness result to square matrices (we give the simple proof in Appendix A): Corollary 4 Theorem 3 holds even if the matrix is constrained to be square. We will need the following upper and lower bounds on the pseudo-cut-norm of A: Lemma 5 For an m × n matrix A, we have ||A|| ||A|| mn||A|| . . This yields So let B be any n × n input matrix. Since ||B|| = ||B||, we may rescale B as a preprocessing step, so we assume without loss of generality that B has ||B|| = 1. Now fix any c > 1 and consider an execution of ALG on the matrix A = ((n)/c)B. If ALG returns "(i)" then (ii) does not hold, so ||A|| > (1 - )(n), which implies that B > (1 - )c. Similarly, if ALG returns "(ii)" then B must have B c. Hardness for Restricted Boltzmann Machines The algorithm ALG maintains an interval [, u] of possible values for log ||B|| which it successively prunes using ALG to do a rough binary search. Using B = 1 and Lemma 5, initially we may take [, u] = [0, 2 ln n] to be an interval of length r0 = 2 ln n. After the tth stage of binary search using ALG , the new length rt of the interval is related to the old length rt-1 by rt rt-1 /2 + log(1/(1 - )). As long as rt-1 is at least 4 log(1/(1 - )), this implies rt 3rt-1 /4. So after O(log log n) iterations of binary search, ALG narrows the initial interval [0, 2 ln n] to an interval [, u] of width at most 4 log(1/(1 - )). (We note that each execution of ALG in the binary search indeed uses a value c which is at least 1 as required.) Algorithm ALG outputs e(u+)/2 as its estimate of B . Since u - 4 log(1/(1 - )) implies that eu- , the estimate e(u+)/2 is accurate for B to within a multiplicative approximation factor of . As noted at the start of the proof, since could be any constant greater than 0, this contradicts hardness of approximating B (Theorem 3). 2 An easy consequence of Lemma 6 is that it is hard to distinguish between RBMs whose partition functions are "large" versus "much less large": Lemma 7 There is a universal constant > 0 such that if P = N P , then there is no polynomial-time algorithm to solve the following promise problem: Input: An n-by-n matrix A such that ||A|| (n) and either (i) ZA > exp((n)); or (ii) ZA 4n exp((1 - )(n)). Output: Answer whether (i) or (ii) holds. Proof: By Lemma 6, if an n-by-n matrix A satisfies ||A|| (n) and either (a) maxh,x hT Ax > (n) holds or (b) maxh,x hT Ax (1 - )(n) holds, it is hard to determine whether (a) or (b) holds. It is clear that (a) implies (i) and that (b) implies (ii). Since (n) = (n), for all but finitely many n we have that the two alternatives (i) and (ii) are mutually exclusive. So for all sufficiently large n, an algorithm to determine whether (i) or (ii) holds could directly be used to determine whether (a) or (b) holds. 2 Proof of Theorem 1: Let > 0 be the constant from Lemma 7. Let U = exp((n)) and L = 4n exp((1 - 1 1- 2 1 1- 4 )(n)). An algorithm that can approximate Z to within a multiplicative factor of U/L can distinguish Z U from Z < L. We have U = exp L (n) - n ln 4 2 , so an approximation factor better than this would contradict Lemma 7, and thus any < /2 suffices in Theorem 1. 4. Approximate Evaluation is Hard In this section we show that it is hard to approximately evaluate a given RBM A on a given input string x: Theorem 8 There is a universal constant > 0 such that if P = N P , then there is no polynomial-time algorithm with the following property: Given as input an n × n matrix A satisfying A (n) (where the function satisfies (n) = (n)) and an input string x {-1, 1}n, the algorithm approximates the probability RBMA (x) to within a multiplicative factor of e(n) . Note that since (n) = (n), the above result implies that approximating RBMA (x) to the smaller multiplicative factor eKn is also hard, where K may be any positive constant. We will use the fact that the numerator of the expression (1) for RBMA (x) can be computed efficiently, which is known and not difficult to show. (See e.g. (5.12) of (Bengio, 2009) for an explicit proof.) Lemma 9 There is a poly(n)-time algorithm that, given A and x, computes h{-1,1}n exp(hT Ax). Now we are ready for the proof that evaluation is hard. Proof of Theorem 8: We actually show something stronger: if P = NP then there is no polynomialtime algorithm which, given an RBM A as described, can output any pair (x, RBMA (x)), where RBMA (x) is a multiplicative e(n) -approximation to RBMA (x). (In other words, not only is it hard to approximate RBMA (x) for a worst-case pair (A, x), but for a worstcase A it is hard to approximate RBMA (x) for any x.) To see this, note that since Lemma 9 implies that we can efficiently exactly evaluate the numerator of the expression (1) for RBMA (x), approximating RBMA (x) to within a given multiplicative factor is equivalent to approximating 1/ZA to within the same factor. But since f (u) = 1/u is monotone in u, for an estimate Z of Z and a desired approximation factor c, we of Hardness for Restricted Boltzmann Machines course have that 1 1 1 1 × c× ^ c Z Z Z if and only if ^ cZ Z Z/c, so we are done by Theorem 1. . 5.1. Preliminaries As the above proof sketch suggests, we will need to build RBM models for various distributions obtained by conditioning on fixed values of some of the observed variables. Definition 10 Let P be a probability distribution over {-1, 1}n, i1 , . . . , ik be a list of distinct variable indices from [n], and xi1 , . . . , xik be values for those variables. We write cond(P ; Xi1 = xi1 , ..., Xik = xik ) to denote the distribution obtained by drawing a random variable (X1 , ..., Xn ) distributed according to P and conditioning on the event that Xi1 = xi1 , ..., Xik = xik . It will be helpful to have notation for the numerator in the formula for the probability assigned by an RBM. Definition 11 For an RBM = (A, a, b) and a string x {-1, 1}n, define the energy of x w.r.t. , denoted by f (x), to be h{-1,1}n exp(hT a + hT Ax + bT x). 5.2. Building RBM Models for Conditional Distributions In the following lemmas, for parameters = (A, a, b) we write to denote max{ A , a , b }. Lemma 12 There is a poly(n,1/, )-time algorithm with the following properties: The algorithm is given any > 0, any parameters = (A, a, b) where A is an n × n matrix (and all the parameters of A, a, b have poly(n) bits of precision), and any values u1 , . . . , uk {-1, 1} for observed variables X1 , . . . , Xk . The algorithm outputs a parameterization = (A , a , b ) of an RBM such that dT V (RBM , cond(RBM ; X1 = u1 , ..., Xi = uk )) . Moreover, the matrix A is (n + k) × n, the parameterization satisfies poly( , n, log 1/), and all the parameters A , a , b have poly(n) + O(log log(1/)) bits of precision. Proof: We will start by describing how the algorithm handles the case in which u1 = ... = uk = 1, and then describe how to handle the general case at the end of the proof. The RBM parameterized by is obtained from by · adding k extra hidden nodes; · adding k rows to A, and making An+j,j a large positive value M (we will show that an integer value that is not too large will suffice), and the other components of the n + jth row are all 0; · adding k components to a that are the same large value M . The vector b equals b. 5. Approximate Simulation is Hard In this section we establish the hardness of constructing an efficiently evaluatable representation of any distribution P that is close to RBM , where = (A, a, b) is a given set of RBM parameters. Our proof is inspired by Jerrum, Valiant and Vazirani's proof that approximate counting reduces to approximate uniform sampling (Jerrum et al., 1986). To aid in explaining our proof, in the following paragraph we briefly recall the idea of the (Jerrum et al., 1986) reduction from approximate counting to approximate uniform sampling, and then explain the connection to our scenario. Let S {0, 1}n be a set whose elements we would like to approximately count, i.e. our goal is to approximate |S|. Suppose that ALG is an algorithm that can approximately sample a uniform element from S, i.e. each element of S is returned by ALG with some probability 1 1 1 in the range [ 1+ · |S| , (1 + ) · |S| ]. The reduction proceeds for n stages. In the first stage, by drawing samples from S using ALG, it is possible to approximately estimate |S0 | and |S1 | where Sb is {x S : x1 = b}, so that the larger one is estimated to within a multiplicative factor of (1 + 2 ). Let b1 {0, 1} be the bit such ^ ^ that |Sb1 | |S|/2, where |Sb1 | is the estimated size of ^ ^ |Sb1 |, and let p1,b1 1/2 denote |Sb1 |/|S|. In the second stage we repeat this process, using ALG on Sb1 , ^ ^ ^ to obtain values b2 , |Sb1 ,b2 | and p2,b2 , where |Sb1 ,b2 | is the estimated size of Sb1 ,b2 = {x S : x1 = b1 and x2 = b2 }. By continuing in this way for n stages, we reach a set Sb1 ,...,bn of size 1. The final estimate of |S| n is 1/ i=1 pi,bi , and it can be shown that this is an ac^ curate approximator of |S| to within a multiplicative factor of (1 + 2 )n . In our setting the ability to approximately simulate a given RBM distribution plays the role of the ability to approximately sample from sets like Sb1 ,...,bi . Approximate counting of |S| corresponds to approximately computing RBM (x) for a particular string x (which corresponds to the bitstring b1 . . . bn ). Since Theorem 1 implies that approximating RBM (x) is hard, it must be the case that approximately simulating a given RBM distribution is also hard. Hardness for Restricted Boltzmann Machines Roughly, the idea is as follows. Each hidden node clamps the value of one variable. The role of an+j is to clamp the value of the jth extra hidden variable to 1. The role of An+j,j is to force Xj to be equal to the value of the jth extra hidden variable, and therefore equal to 1. Let A , a and b be these parameters. Breaking up the vector of hidden variables into its new components and old components, and applying the definitions of A and h , we have f (x1 , ..., xn ) = h{-1,1}n+k For any x F , let the "correction" of x, called c(x), be the member of F obtained by setting the first k components of x to 1. We have RBM (¬F ) xF (x)f (x) = xX (x)f (x) xF (x)f (x) xF (x)f (x) = (2k - 1) (2k - 1) xF xF (x)f (x) (x)f (x) k xF (2 - 1)(x)f (x) xF exp(hT a +hT A x+(b )T x) exp(hT a+hT Ax+bT x h{-1,1}n g{-1,1}k k = (c(x))f (c(x)) = + j=1 M xj gj + M gj ) (since, x F, |c-1 ({x})| = 2k - 1) (2k - 1) xF (x)f (x) (since c(x) F ) = xF (F )f (c(x)) 4k e-2M xF xF f (x) which immediately gives f (x1 , ..., xn ) g{-1,1}k f (c(x)) (5) k = exp( j=1 M xj gj + M gj ) by (2). Finally, since c(x) differs from x in k components, we have that for all x F , f (x) e2k(n+1)|||| f (c(x)). Thus (5) implies that RBM (¬F ) 4k e2k(n+1)|||| e-2M and this, together with (3) and (4), implies RBM (E) cond(RBM ; F )(E)+4k e2k(n+1)|||| e-2M . This implies that RBM (E) cond(RBM ; F )(E) + provided that M (1/2) ln(1/) + k ln 2 + k(n + 1)|||| . Since the event E was arbitrary, recalling the definition of total variation distance, this completes the proof in the case u1 = ... = uk = 1. The general case can be treated with a nearly identical analysis, after setting An+j,j to be uj M for each j = 1, . . . , k. 2 (6) × h{-1,1}n k exp(hT a+hT Ax+bT x) = exp( g{-1,1}k j=1 ×f (x1 , ..., xn ). M xj gj + M gj ) Let F be the event that X1 = 1, ..., Xk = 1 and let k (x) = exp( g{-1,1}k j=1 M xj gj + M gj ) . Note that (x) is maximized using any x F . Let (F ) be this value. Note that (F ) e2kM , and (x) 2k e(2k-2)M otherwise, so that x F, (x) 2k e-2M (F ). n (2) 5.3. Approximate Simulation is Hard Theorem 13 If RP = N P , then there is no polynomial-time algorithm with the following property: Given as input = (A, a, b) such that A is an n × n matrix and (n) (where (n) = (n)), the algorithm outputs an efficiently evaluatable representation of a distribution whose total variation distance from RBM is at most = 1/12. Now fix any event E, and let X denote {-1, 1} . We have RBM (E) RBM (E|F ) + RBM (¬F ). First, RBM (E|F ) = xEF xF (3) (F )f (x) = RBM (E|F ). (F )f (x) (4) Hardness for Restricted Boltzmann Machines Proof: The proof is by contradiction; here is a highlevel outline. We will suppose that OUTPUT-DIST is a polynomial-time algorithm that, on input , constructs an -close distribution to RBM . We will show that there is a randomized algorithm which, using OUTPUT-DIST, with high probability (at least 9/10) can efficiently find a particular x {-1, 1}n for which the value of RBM (x) can be estimated to within a multiplicative factor of 2n . Since RBM (x) = f (x)/Z , and we can efficiently exactly compute f (x) (see Lemma 9), this will imply that we can approximate Z to within a factor 2n . Since 2n is less than e(n) , where is the constant from Theorem 1, this contradicts Theorem 1 unless RP = NP. The randomized algorithm creates x = (x1 , ..., xn ) as follows. Fix = 1/10. Set 1 = , and, for each i [n] in turn, · Run OUTPUT-DIST to obtain an efficiently evaluatable representation of a distribution Pi such that dT V (Pi , RBMi ) ; · Sample (4/ 2 ) log(n/) times from Pi (note that this can be done efficiently); · Set xi to be the bit that appears most often for the ith component Xi in the sampled strings; · Use Lemma 12 to construct a set of RBM parameters i+1 such that dT V (RBMi+1 , cond(RBM ; X1 = x1 , ..., Xi = xi )) . For each i [n] let pi denote the fraction of times ^ that Xi = xi over the samples drawn during the ith stage. The Chernoff-Hoeffding bound implies that, with probability 1 - , for all i [n] we have |^i - Pi [Xi = xi ]| p (where the notation D[E] denotes the probability of event E under distribution D). Since dT V (Pi , RBMi ) , this implies that |^i - RBMi [Xi = xi ]| 2, p and the fact that dT V (RBMi , cond(RBM ; X1 = x1 , ..., Xi-1 = xi-1 )) implies that |^i - RBM [Xi = xi | X1 = x1 , ..., Xi-1 = xi-1 ]| 3. p Since = 1/12 and pi 1/2, this implies that ^ RBM [Xi = xi | X1 = x1 , ..., Xi-1 = xi-1 ] 1/4 and therefore that pi ^ 1 2. 2 RBM [Xi = xi | X1 = x1 , ..., Xi-1 = xi-1 ]| (7) Let us compute an estimate p of RBM (x) by setting ^ ^ p = i pi . We can use the probability chain rule to ^ evaluate the accuracy of this estimate as follows: p ^ = RBM (x) ^ 1 i pi n , 2n . RBM (xi |x1 ...xi-1 ) 2 i As mentioned above, since RBM (x) = f (x)/Z and we can exactly compute f (x) in polynomial time, this implies that we can estimate 1/Z to within a 2n factor, and, as noted in the proof of Theorem 8, this implies that we can estimate Z to within a 2n factor. This contradicts Theorem 1 and completes the proof. 2 6. Discussion We have established strong worst-case computational hardness results for the basic tasks of evaluating and simulating RBMs. We view these hardness results as providing additional motivation to obtain a comprehensive theoretical understanding of why and how RBMs perform well in practice. One possibility is that the parameters of real-world RBMs tend to be even smaller than the bounds satisfied by the constructions used to establish our results; the recent analysis of (Bengio & Delalleau, 2009) seems to conform with this possibility. Further study of the theoretical benefits of different properties of RBM models, and of algorithms that promote those properties, may lead to improvements in the practical state of the art of learning using these models. References Abe, N. and Warmuth, M. K. On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9(2­3):205­ 260, 1992. Alon, N. and Naor, A. Approximating the cut-norm via Grothendieck's inequality. In STOC, pp. 72­80, 2004. Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1­ 127, 2009. Bengio, Y. and Delalleau, O. Justifying and generalizing contrastive divergence. Neural Computation, 21 (6):1601­1621, 2009. Hardness for Restricted Boltzmann Machines Bogdanov, A., Mossel, E., and Vadhan, S. P. The complexity of distinguishing markov random fields. In APPROX-RANDOM, pp. 331­342, 2008. Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research. To appear. Freund, Y. and Haussler, D. Unsupervised learning of distributions of binary vectors using 2-layer networks. In NIPS, pp. 912­919, 1991. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771­1800, 2002. Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527­1554, 2006. Jerrum, M., Valiant, L. G., and Vazirani, V. V. Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci., 43:169­ 188, 1986. Kearns, M., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R., and Sellie, L. On the learnability of discrete distributions. In Proc. of Twenty-sixth ACM Symposium on Theory of Computing, pp. 273­282, 1994. Larochelle, H., Erhan, D., Courville, A. C., Bergstra, J., and Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, pp. 473­480, 2007. Roth, D. On the hardness of approximate reasoning. Artificial Intelligence, 82(1-2):273­302, 1996. Smolensky, P. Information processing in dynamical systems: Foundations of harmony theory. In Rumelhart, D. E., McClelland, J. L., et al. (eds.), Parallel Distributed Processing: Volume 1: Foundations, pp. 194­281. MIT Press, Cambridge, 1987. Yasuda, M. and Tanaka, K. Approximate learning algorithm for restricted boltzmann machines. In CIMCA/IAWTIC/ISE, pp. 692­697, 2008. as the pseudo-cut-norm of A: for any h, u {-1, 1}m, if we form x out of the first n components of u, then, since the last m - n columns of B are all zeroes, hT Bu does not depend on the last m - n components of u, so hT Bu = hT Ax. Since any h and x may be formed this way, we have ||A|| = max hT Ax = max hT Bu = ||B||. h,x h,u Thus, a polynomial-time algorithm for approximating the pseudo-cut-norm for square matrices (like B) to within an arbitrary constant factor would yield a corresponding polynomial-time algorithm for approximating the pseudo-cut-norm of matrices (like A) for which m > n. A. Proof of Corollary 4 Let A be a non-square m×n matrix; we suppose m > n (the other case is entirely similar). We may pad A with m - n all-zero columns on the right, to form an m × m matrix B which is at most quadratically larger than A. We claim that the pseudo-cut-norm of B is the same