Polynomial regression under arbitrary product distributions

Eric Blais and Ryan O'Donnell and Karl Wimmer Carnegie Mellon University {eblais@cs,odonnell@cs,kwimmer@andrew}.cmu.edu

Abstract
In recent work, Kalai, Klivans, Mansour, and Servedio [KKMS05] studied a variant of the "Low-Degree (Fourier) Algorithm" for learning under the uniform probability distribution on {0, 1}n. They showed that the L1 polynomial regression algorithm yields agnostic (tolerant to arbitrary noise) learning algorithms with respect to the class of threshold functions -- under certain restricted instance distributions, including uniform on {0, 1}n and Gaussian on Rn . In this work we show how all learning results based on the Low-Degree Algorithm can be generalized to give almost identical agnostic guarantees under arbitrary product distributions on instance spaces X1 в Ї Ї Ї в Xn . We also extend these results to learning under mixtures of product distributions. The main technical innovation is the use of (Hoeffding) orthogonal decomposition and the extension of the "noise sensitivity method" to arbitrary product spaces. In particular, we give a very simple proof that threshold functions over arbitrary  product spaces have  -noise sensitivity O(  ), resolving an open problem suggested by Peres [Per04].

Given m examples of training (x1 , y1 ), . . . , (xm , ym )  X в {-1, 1},

data

1. Expand each instance xi into a vector from {0, 1}|X1|+ЇЇЇ+|Xn | via the "one-outof-k " encoding. 2. Consider "features" which are products of up to d of the new 0-1 attributes. 3. Find the linear function W in the feature space that best fits the training labels under some loss measure : e.g., squared loss, hinge loss, or L1 loss. 4. Output the hypothesis sgn(W - ), where   [-1, 1] is chosen to minimize the hypothesis' training error. We will refer to this algorithm as "degree-d polynomial regression (with loss )". When  is the hinge loss, this is equivalent to the soft margin SVM algorithm with the degree-d polynomial kernel and no regularization [CV95].2 When  is the squared loss and the data is drawn i.i.d. from the uniform distribution on X = {0, 1}n , the algorithm is effectively equivalent to the Low-Degree Algorithm of Linial, Mansour, and Nisan [LMN93] -- see [KKMS05]. Using techniques from convex optimization (indeed, linear programming for L1 or hinge loss, and just basic linear algebra for squared loss), it is known that the algorithm can be performed in time poly(m, nd ). For all known proofs of good generalization for the algorithm, m = n(d) / training examples are necessary (and sufficient). Hence we will view the degree-d polynomial regression algorithm as requiring poly(nd /) time and examples. (Because of this, whether or not one uses the "kernel trick" is a moot point.) Although SVM-based algorithms are very popular in practice, the scenarios in which they provably learn successfully are relatively few (see Section 1.2 below) -- especially when there is error in the labels. Our goal in this paper is to broaden the class of scenarios in which learning with polynomial regression has provable, polynomial-time guarantees.
Except for the minor difference of choosing an optimal  rather than fixing  = 0.
2

1 Introduction
In this paper we study binary classification learning problems over arbitrary instance spaces X = X1 в Ї Ї Ї в Xn . In other words, each instance has n "categorical attributes", the ith attribute taking values in the set Xi . For now we assume that each Xi has cardinality at most poly(n).1 It is convenient for learning algorithms to encode instances from X as vectors in {0, 1}|X1 |+ЇЇЇ+|Xn | via the "one-out-ofk encoding"; e.g., an attribute from X1 = {red, green, blue} is replaced by one of (1, 0, 0), (0, 1, 0), or (0, 0, 1). Consider now the following familiar learning algorithm:
 Supported in part by a scholarship from the Fonds quebecois ЄЄ de recherche sur la nature et les technologies. 1 Given real-valued attributes, the reader may think of bucketing them into p oly(n) buckets.


1.1 The learning framework We study binary classification learning in the natural "agnostic model" [KSS94] (sometimes described as the model with arbitrary classification noise). We assume access to training data drawn i.i.d. from some distribution D on X , where the labels are provided by an arbitrary unknown "target" function t : X  {-1, 1}. The task is to output a hypothesis h : X  {-1, 1} which is a good predictor on future examples from D. We define the "error of h" to be err(h) = PrxD [h(x) = t(x)].3 We compare the error of an algorithm's hypothesis with the best error achievable among functions in a fixed class C of functions X  {-1, 1}. Define Opt = inf f C err(f ). We say that an algorithm A "agnostically learns with respect to C " if, given  > 0 and access to training data, it outputs a hypothesis h which satisfies E[err(h)]  Opt + . Here the expectation is with respect to the training data drawn.4 The running time (and number of training examples) used are measured as functions of n and . Instead of an instance distribution D on X and a target t : X  {-1, 1}, one can more generally allow a distribution D on X в{-1, 1}; in this case, err(h) = Pr(x,y)D [h(x) = y ]. Our learning results also hold in this model just as in [KKMS05]; however we use the simpler definition for ease of presentation, except in Section 5.3. In the special case when t is promised to be in C we are in the scenario of PAC learning [Val84]. This corresponds to the case Opt = 0. Since C is usually chosen (by necessity) to be a relatively simple class, the PAC model's assumption that there is a perfect classifier in C is generally considered somewhat unrealistic. This is why we work in the agnostic model. Finally, since strong hardness results are known [KSS94, LBW95, KKMS05, GR06] for agnostic learning under general distributions D, we are forced to make some distributional assumptions. The main assumption in this paper is that D is a product probability distribution on X ; i.e., the n attributes are independent. For a discussion of this assumption and extensions, see Section 1.3. 1.2 When polynomial regression works Although the SVM algorithm is very popular in practice, the scenarios in which it provably learns successfully are relatively few. Let us consider the SVM algorithm with degreed polynomial kernel. The traditional SVM analysis is predicated on the assumption that the data is perfectly linearly separable in the polynomial feature space. Indeed, the heuristic arguments in support of good generalization are predicated on the data being separable with large margin. Even just the assumption of perfect separation may well be unreasonable. For example, suppose the target t is the very simple
In this paper, boldface denotes random variables. The definition of agnostic learning is sometimes taken to require error at most Opt +  with high probability, rather than in expectation. However this is known [KKMS05] to require almost negligible additional overhead.
4 3

function given by the intersection of two homogeneous linear threshold functions over Rn ; i.e., It is known [MP69] that this target cannot be classified by the sign of a degree-d polynomial in the attributes for any finite d; this holds even when n = 2. Alternatively, when t is the intersection of two linear threshold functions over {0, 1}n , it is not currently known if t can be classified by the sign of a degree-d polynomial for any d < n - 1. [OS03] Because of this problem, one usually considers the "soft margin SVM algorithm" [CV95]. As mentioned, when this is run with no "regularization", the algorithm is essentially equivalent to degree-d polynomial regression with hinge loss. To show that this algorithm even has a chance of learning efficiently, one must be able to show that simple target functions can at least be approximately classified by the sign of low-degree polynomials. Of course, even stating any such result requires distributional assumptions. Let us make the following definition: Definition 1.1 Let D be a probability distribution on {0, 1}N and let t : {0, 1}N  R. We say that t is -concentrated up to degree d (under D) if there exists a polynomial p : {0, 1}N  R of degree at most d which has squared loss at most  under D; i.e., ExD [(p(x) - t(x))2 ]  . It is well known that under the above conditions, h := sgn(p) has classification error at most  under D. Further, it is relatively easy to show that if C is a class of functions each of which is -concentrated up to degree d, then the degree-d polynomial regression algorithm with squared loss will PAClearn C to accuracy O() under D. t : Rn  { - 1 , 1 } , t(x) = sgn(w1 Ї x)  sgn(w2 Ї x).

The first result along these lines was due to Linial, Mansour, and Nisan [LMN93] who introduced the "Low-Degree Algorithm" for PAC-learning under the uniform distribution on {0, 1}n. They showed that if f : {0, 1}n  {-1, 1} is computed by a circuit of size s and depth c then it is concentrated up to degree (O(log(s/)))c under the uniform distribution. Some generalizations of this result [FJS91, Has01] ╟ are discussed in Section 4. Another result using this idea was due to Klivans, O'Donnell, and Servedio [KOS04]. They introduced the "noise sensitivity method" for showing concentration results under the uniform distribution on {0, 1}n. In particular, they showed that any t : {0, 1}n  {-1, 1} expressible as a function of k linear threshold functions is -concentrated up to degree O(k 2 /2 ) under the uniform distribution. These works obtained PAC learning guarantees for the polynomial regression algorithm -- i.e., guarantees only holding under the somewhat unrealistic assumption that Opt = 0. A significant step towards handling noise was taken in [KKMS05]. Therein it was observed that low-degree L2 2 approximability bounds imply L1 -approximability bounds (and hinge loss bounds), and further, such bounds imply that the polynomial regression algorithm works in the agnostic learning model. Specifically, their work contains the following theorem:


Theorem 1.2 ([KKMS05]) Let D be a distribution on {0, 1}N and let C be a class of functions {0, 1}N  {-1, 1} each of which is 2 -concentrated up to degree d under D. Then the degree-d polynomial regression algorithm with L1 loss (or hinge loss [Kal06]) uses poly(N d /) time and examples, and agnostically learns with respect to C under D. Thus one gets agnostic learning algorithms under the uniform distribution on {0, 1}n with respect to the class of AC0 circuits (time npolylog(n/) ) and the class of functions of k 2 4 thresholds (time nO(k / ) ) -- note that the latter is polynomial time assuming k and  are constants. Kalai et al. also obtained related results for agnostically learning with respect to single threshold functions under Gaussian and logconcave distributions on Rn . 1.3 Overview of our learning results We view the work of [KKMS05] as the first provable guarantee that one can learn interesting, broad classes of functions under the realistic noise model of agnostic learning (and in particular, that SVM-type methods can have this guarantee). One shortcoming of the present state of knowledge is that we have good concentration bounds for classes essentially only with respect to the uniform distribution on {0, 1}n and the Gaussian distribution on Rn .5 In this work we significantly broaden the class of distributions for which we can prove good concentration bounds, and hence for which we can prove the polynomial regression algorithm performs well. Roughly speaking, we show how to generalize any concentration result for the uniform distribution on {0, 1}n into the same concentration result for arbitrary product distributions D on instance spaces X = X1 в Ї Ї Ї в Xn . We believe this is a significant generalization for several reasons. First, even just for the instance space {0, 1}n the class of arbitrary product distributions is much more reasonable than the single distribution in which each attribute is 0 or 1 with probability exactly 1/2. Our results are even stronger than this, though: they give an algorithm that works simultaneously for any product distribution over any instance space X = X1 в Ї Ї Ї в Xn where each |Xi |  poly(n). Because we can handle non-binary attributes, the restriction to product spaces becomes much less severe. A common criticism of learning results under the uniform distribution or product distributions on {0, 1}n is that they make the potentially unreasonable assumption that attributes are independent. However with our results, one can somewhat circumvent this. Suppose one believes that the attributes X1 , . . . , Xn are mostly independent, but some groups of them (e.g., height and weight) have mutual dependencies. One can then simply group together any dependent attribute sets Xi1 , . . . , Xit into a single "super-attribute" set (Xi1 в Ї Ї Ї в Xit ). Assuming that this eliminates dependencies -- i.e., the new (super-)attributes are all independent -- and that each
[FJS91] gives bounds for AC under constant-bounded product distributions on {0, 1}n ; [KKMS05] gives inexplicit bounds for a single threshold function under log-concave distributions on Rn .
5 0

|Xi1 в Ї Ї Ї в Xit | is still at most poly(n), one can proceed to use the polynomial regression algorithm. Here we see the usefulness of being able to handle arbitrary product distributions on arbitrary product sets. In many reasonable cases our results can also tolerate the attribute sets Xi having superpolynomial size. What is really necessary is that the probability distribution on each Xi is mostly concentrated on polynomially many attributes. Indeed, we can further handle the common case when attributes are real-valued. As long as the probability distributions on real-valued attributes are not extremely skewed (e.g., Gaussian, exponential, Laplace, Pareto, chi-square, . . . ) our learning results go through after doing a naive "bucketing" scheme. Finally, being able to learn under arbitrary product distributions opens the door to learning under mixtures of product distributions. Such mixtures -- especially mixtures of Gaussians -- are widely used as data distribution models in learning theory. We show that agnostic learning under mixtures can be reduced to agnostic learning under single product distributions. If the mixture distribution is precisely known to the algorithm, it can learn even under a mixture of polynomially many product distributions. Otherwise, when the mixture is unknown, we first need to use an algorithm for learning (or clustering) a mixture of product distributions from unlabeled examples. This is a difficult but well-studied problem. Using results of Feldman, O'Donnell, and Servedio [FOS05, FOS06] we can extend all of our agnostic learning results to learning under mixtures of constantly many product distributions with each |Xi |  O(1) and constantly many (axis-aligned) Gaussian distributions. 1.4 Outline of technical results In Section 2 we recall the orthogonal decomposition of functions on product spaces, as well as the more recently-studied notions of concentration and noise sensitivity on such spaces. In particular, we observe that if one can prove a good noise sensitivity bound for a class C under a product distribution , then [KKMS05] implies that the polynomial regression algorithm yields a good agnostic learner with respect to C under . Section 3 contains the key reduction from noise sensitivity in general product spaces to noise sensitivity under the uniform distribution on {0, 1}n. It is carried out in the model case of linear threshold functions, which Peres [Per04] proved  have  -noise sensitivity at most O(  ). We give a surprisingly simple proof of the following: Theorem 3.2 Let f : X  {-1, 1} be a linear threshold function, where X = X1 в Ї Ї Ї в Xn has the product distri bution  = 1 в Ї Ї Ї в n . Then NS (f )  O(  ). Proving this just in the case of a p-biased distribution on {0, 1}n was an open problem suggested in [Per04]. This noise sensitivity bound thus gives us the following learning result:


Theorem 3.4 Let  = 1 в Ї Ї Ї в n be any product distribution over an instance space X = X1 в Ї Ї Ї в Xn , where we assume |Xi |  poly(n) for each i. Let C denote the class of functions of k linear threshold functions over X . Taking d = O(k 2 /4 ), the degree-d polynomial regression 2 4 algorithm with L1 loss (or hinge loss) uses nO(k / ) time and examples and agnostically learns with respect to C . In Section 4 we discuss how to extend concentration results for other concept classes from uniform on {0, 1}n to arbitrary product distributions on product spaces X = X1 в Ї Ї Ї в Xn . Of course, it's not immediately clear, given a concept class C of functions on {0, 1}n , what it even means for it to be generalized to functions on X . We discuss a reasonable such notion based on one-out-of-k encoding, and illustrate it in the case of AC 0 functions. The idea in this section is simple: any concentration result under uniform on {0, 1}n easily implies a (slightly weaker) noise sensitivity bound; this can be translated into the same noise sensitivity bound under any product distribution using the methods of Section 3. In turn, that implies a concentration bound in the general product space. As an example, we prove the following: Theorem 4.2 Let C be the class of functions X1 в Ї Ї Ї в Xn  {-1, 1} computed by unbounded fan-in circuit of size at most s and depth at most c (under the one-out-of-k encoding). Assume |Xi |  poly(n) for each i. Let  be any product distribution on X1 в Ї Ї Ї в Xn . Then polynomial regression agnostically learns with respect to C under arbic -1 2 trary product distributions in time n(O(log(s/))) / . Section 5 describes extensions of our learning algorithm to cases beyond those in which one has exactly a product distribution on an instance space X = X1 в Ї Ї Ї в Xn with each |Xi |  poly(n): these extensions include distributions "bounded by" or "close to" product distributions, as well as certain cases when the Xi 's have superpolynomial cardinality or are R. We end Section 5 with a discussion of learning under mixtures of product distributions. Here there is a distinction between learning when the mixture distribution is known to the algorithm and when it is unknown. In the former case we prove, e.g.: Theorem 5.16 Let D be any known mixture of poly(n) product distributions over an instance space X = X1 в Ї Ї Ї в Xn , where we assume |Xi |  poly(n) for each i. Then there 2 4 is a nO(k / ) -time algorithm for agnostically learning with respect to the class of functions of k linear threshold functions over X under D. In the latter case, by relying on algorithms for learning mixture distributions from unlabeled data, we prove: Theorem 5.18 Let D be any unknown mixture of O(1) product distributions over an instance space X = X1 в Ї Ї Ї в Xn , where we assume either: a) |Xi |  O(1) for each i; or b) each Xi = R and each product distribution is a mixture of axis-aligned (poly(n)-bounded) Gaussians. Then there is a 2 4 nO(k / ) -time algorithm for agnostically learning with respect to the class of functions of k linear threshold functions over X under D.

2 Product probability spaces
In this section we consider functions f : X  R, where X = X1 в Ї Ї Ї в Xn is a product set. We will also assume X is endowed with some product probability distribution  = 1 в Ї Ї Ї в n . All occurrences of Pr[Ї] and E[Ї] are with respect to this distribution unless otherwise noted, and we usually write x for a random element of X drawn from . For simplicity we assume that each set Xi is finite.6 The vector space L2 (X , ) of all functions f : X  R is viewed as an inner product space under the inner product f, g = E[f (x)g (x)]. We will also use the notation f E f 2= ,f = [f (x)2 ].

2.1 Orthogonal decomposition

As each Xi is just an abstract set, there is not an inherent notion of a degree-d polynomial on X . Ultimately the polynomial regression algorithm identifies X with a subset of {0, 1}|X1|+ЇЇЇ+|Xn | via the"one-out-of-k encoding" and works with polynomials over this space. However to prove concentration results, we need to take a more abstract approach and consider the "(Hoeffding) orthogonal decomposition" of functions on product spaces; see [vM47, Hoe48, KR82, Ste86]. In this section we recall this notion with our own notation. Definition 2.1 We say a function f : X1 в Ї Ї Ї в Xn  R is a simple function of order d if it depends on at most d coordinates. Definition 2.2 We say a function f : X1 в Ї Ї Ї в Xn  R is a function of order d if it is a linear combination of simple functions of order d. The set of all such functions is a linear subspace of L2 (X , ) and we denote it by Hd (X , ). Definition 2.3 We say a function f : X1 в Ї Ї Ї в Xn  R is a function of order exactly d if it is a function of order d and it is orthogonal to all functions of order d - 1; i.e., f, g = 0 for all g  Hd-1 (X , ). This is again a linear subspace of L2 (X , ) and we denote it by H=d (X , ). Proposition 2.4 The space L2 (X , ) is the orthogonal direct sum of the H=d (X , ) spaces, L2 (X , ) = dn H=d (X , ).

=0

Definition 2.5 By virtue of the previous proposition, every function f : X1 в Ї Ї Ї в Xn  R can be uniquely expressed as f = f =0 + f =1 + f =2 + Ї Ї Ї + f =n , where f =d : X1 в Ї Ї Ї в Xn  R denotes the projection of f into H=d (X , ). We call f =d the order d part of f . We will also write f  d = f =0 + f =1 + f =2 + Ї Ї Ї + f =d .
In fact, we will only need that each L2 (Xi , i ) has a countable basis.
6


In the sequel we will write simply H=d in place of H=d (X , ), Theorem 2.12 Let  = 1 в Ї Ї Ї в n be a product distribution on X = X1 в Ї Ї Ї в Xn . Write N for the total etc. Although we will not need it, we recall a further refinenumber of possible attribute values, N = |X1 | + Ї Ї Ї + |Xn |. ment of this decomposition: Let C be a class of functions X  {-1, 1} each of which is Definition 2.6 For each S  [n] we define HS to be the 2 -concentrated up to order d under . Then the degree-d subspace consisting of all functions depending only on the polynomial regression algorithm with L1 loss (or hinge loss) coordinates in S . We define HS to be the further subspace uses poly(N d /) time and examples, and agnostically learns consisting of those functions in HS that are orthogonal to with respect to C under . all functions in HR for each R S. We will now show how to prove low-order concentration results by extending the "noise sensitivity method" of [KOS04] Proposition 2.7 The space L2 (X , ) is thS orthogonal die to general product spaces. H S . He n c e rect sum of the HS spaces, L2 (X , ) =
[n]

every function f : S 1 в Ї Ї Ї в Xn  R can be uniquely exX S S pressed as f = [n] f , where f : X1 в Ї Ї Ї в Xn  R denotes the projection of f into HS . Denoting also f S = R R for the projection of f into HS , we have the S f following interpretations: f S (y1 , . . . , yn ) = E[f (x1 , . . . , xn ) | xi = yi  i  S ]; R f S (x1 , . . . , xn ) = (-1)|S |-|R|f R .
S

2.3 Noise sensitivity We recall the generalization of noise sensitivity [BKS99] to general product spaces, described in [MOO05].

Definition 2.13 Given x  X1 в Ї Ї Ї в Xn and 0    1, we define a -noisy copy of x to be a random variable y with distribution N (x), where this denotes that each y i is chosen to equal xi with probability  and to be randomly drawn from i with probability 1 - , independently across i. Definition 2.14 The noise operator T on functions f : X  R is given by (T f )(x) = Ey N (x) [f (y )]. The noise stability of f at  is S (f ) = f, T f . When f : X  {-1, 1} we also define the noise sensitivity of f at   [0, 1] to be NS (f ) = 1 - 1 S1- (f ) = Pr [f (x) = f (y )]. 2 2
x  yN1- (x)

Finally, we connect the orthogonal decomposition of functions f : X  R with their analogue under the one-out-of-k encoding: Proposition 2.8 A function f : X  R is of order d if and only if its analogue f : {0, 1}|X1|+ЇЇЇ+|Xn |  R under the one-out-of-k encoding is expressible as a polynomial of degree at most d. 2.2 Low-order concentration As in the previous section we consider functions f : X  R under a product distribution . We will be especially interested in classifiers, functions f : X  {-1, 1}. Our goal is to understand and develop conditions under which such f can be approximated in squared loss by low-degree polynomials. By basic linear algebra, we have the following: Proposition 2.9 Given f : X  R, the best order-d approximator to f under squared loss is f d . I.e.,
n g of order d

The connection between noise stability, sensitivity, and concentration comes from the following two facts: Proposition 2.15 ([MOO05]) For any f : X  R, in  i f =i 2 . 2 S (f ) =
=0

Proposition 2.16 ([KOS04]) Suppose NS (f )  . Then f is 1-2 /e -concentrated up to order 1/ . 1 For example, Peres proved the following theorem: Theorem 2.17 ([Per04]) If f : {0, 1}n  {-1, 1} is a linear threshold function then  NS (f )  O(1)  (under the uniform distribution on {0, 1}n). From [O'D03] 5 we have that the O(1) can be taken to be 4 for every value of n and  . It clearly follows that if f is any function of k linear thresh old functions then NS (f )  5 k  . Combining this with 4 Proposition 2.16: Theorem 2.18 ([KOS04]) Let f : {0, 1}n  {-1, 1} be  any function of k linear threshold functions. Then f is (4k / d)concentrated up to order d under the uniform distribution, for any d  1. In particular, f is 2 -concentrated up to order O(k 2 /4 ).

min

E[(f (x) - g (x))2 ] = f - f d 2 = 2

i

=d +1

f =i 2 . 2

Definition 2.10 Given f : X n R we say that f is concentrated up to order d if i=d+1 f =i 2  . 2 By Proposition 2.8 we conclude the following: Proposition 2.11 Let f : X  R and identify f with a function {0, 1}N  R under the one-out-of-k encoding. Then there exists a polynomial p : {0, 1}N  R of degree at most d which -approximates f in squared loss under  if and only if f is -concentrated up to order d. Combining this with the KKMS Theorem 1.2, we get the following learning result about polynomial regression:


3 Noise sensitivity of threshold functions in product spaces
In this section we show that Peres's theorem can be extended to hold for linear threshold functions in all product spaces. Definition 3.1 We say a function f : X1 в Ї Ї Ї в Xn  {-1, 1} is a linear threshold function if its analogue f : {0, 1}N  {-1, 1} under one-out-of-k encoding is expressible as a linear threshold function. Equivalently, f is a linear threshold function if there exist weight functions wi : Xi  R, i = 1 . . . n, and a number   R such that . n i wi (xi ) -  f (x1 , . . . , xn ) = sgn
=1

Theorem 3.3 Let f : X  {-1, 1} be any function of k linear threshold functions, where X = X1 в Ї Ї Ї в Xn has the  product distribution  = 1 в Ї Ї Ї в n . Then f is (4k / d)concentrated up to order d, for any d  1. In particular, f is 2 -concentrated up to order O(k 2 /4 ). By combining Theorem 3.3 with our main learning theorem, Theorem 2.12, we conclude: Theorem 3.4 Let  = 1 в Ї Ї Ї в n be any product distribution over an instance space X = X1 в Ї Ї Ї в Xn , where we assume |Xi |  poly(n) for each i. Let C denote the class of functions of k linear threshold functions over X . Taking d = O(k 2 /4 ), the degree-d polynomial regression 2 4 algorithm with L1 loss (or hinge loss) uses nO(k / ) time and examples and agnostically learns with respect to C .

No version of Peres's Theorem 2.17 was previously known to hold even in the simple case of linear threshold functions on {0, 1}n under a p-biased product distribution with p = 1/2. Understanding just this nonsymmetric case was left as an open question in [Per04]. We now show that threshold functions over general product spaces are no more noise sensitive than threshold functions over {0, 1}n under the uniform distribution. Theorem 3.2 Let f : X  {-1, 1} be a linear threshold function, where X = X1 в Ї Ї Ї в Xn has the product distri 5 bution  = 1 в Ї Ї Ї в n . Then NS (f )  4  . Proof: For a pair of instances z0 , z1  X and a vector x  {0, 1}n , we introduce the notation zx for the instance whose ith attribute (zx )i is the ith attribute of zxi . For any fixed z0 , z1  X we can define gz0 ,z1 : {0, 1}n  {-1, 1} such that gz0 ,z1 (x) = f (zx ). Note that this function is a linear threshold function in the traditional binary sense. Let z 0 , z 1 now denote independent random draws from , and let x denote a uniformly random vector from {0, 1}n. We have that z x is distributed as a random draw from . Further pick y  {0, 1}n to be a  -noisy copy of x, i.e. y  N (x). Then z y is distributed as N (z x ). We now have NS (f ) = = =
z 0 , z 1 , x, y

4 Concentration for other classes under product distributions
In this section we illustrate how essentially any result about -concentration of classes of functions under the uniform distribution on {0, 1}n can be translated into a similar result for general product distributions. Besides linear threshold functions, the other main example of concentration comes from the original application of the Low Degree Algorithm [LMN93]: learning AC0 functions in quasipolynomial time. Recall that AC0 is the class of functions computed by unbounded fan-in circuits of constant depth and polynomial size. We will use this as a running example. Suppose C is a class of functions X  {-1, 1}, where X = X1 в Ї Ї Ї в Xn . As usual, under the one-out-of-k encoding we can think of C as a class of functions {0, 1}N  {-1, 1}. In our example, this gives a reasonable notion of "AC0 circuits on general product sets X ". Suppose further that C  C is any class of functions {0, 1}N  {-1, 1} which is closed under negation of inputs and closed under fixing inputs to 0 or 1. In our example, the class of AC0 circuits indeed has this basic property (as does the more precisely specified class of all circuits with size at most s and depth at most c). Now by repeating the proof of Theorem 3.2, it is easy to see that any upper bound one can prove on the noise sensitivity of functions in C under the uniform distribution on {0, 1}N immediately translates an identical bound on the noise sensitivity of functions in C on X under any product distribution. The only thing to notice is that the functions gz0 ,z1 arising in that proof will be in the class C . Thus we are reduced to proving noise sensitivity bounds for functions on {0, 1}n under the uniform distribution. Furthermore, any result on -concentration of functions on {0, 1}n under the uniform distribution can be easily translated into a noise sensitivity bound which is not much worse: Proposition 4.1 Suppose that f : {0, 1}n  {-1, 1} is concentrated up to degree d under the uniform distribution on {0, 1}n. Then NS/d (f )  .

Pr

[f (z x ) = f (z y )]

z 0 ,z 1

E

z 0 ,z 1

E

P r[f (z x ) = f (z y )] x, y . P r[gz0 ,z1 (x) = gz0 ,z1 (y )]
x, y

Once z0 and z1 are fixed, the quantity in the expectation is just the noise sensitivity at  of the binary near threshold li function gz0 ,z1 , which we can bound by 5  using Theo4 rem 2.17. So  P NS (f ) = E r[gz0 ,z1 (x) = gz0 ,z1 (y )] z 0 , z 1 x, y = 5 5 E4 4 ,
z 0 ,z 1

which is what we wanted to show. 2

As with Theorem 2.18, we conclude:


Proof: Using traditional Fourier notation instead of orthogonal decomposition notation, we have S1-/d (f ) = S ^ (1 - /d)|S | f (S )2  (1 - /d)d (1 - )  (1 - )2 ,

5 Extensions
5.1 Distributions close to or dominated by product distributions We begin with some simple observations showing that the underlying distribution need not be precisely a product distribution. First, the following fact can be considered standard: Proposition 5.1 Suppose that under distribution D, algorithm A agnostically learns with respect to class C , using m examples to achieve error . If D is any distribution satisfying D -D 1  /m, then A also agnostically learns under D , using m examples to achieve error 2 + 2/m  4.

[n]

where the first inequality used the fact that f is -concentrated up to degree d. Thus NS1-/d (f ) = 2
1 2 1 - 2 S1-/d (f )  1 2 1 - 2 (1 - )2  .

Proof: The key fact we use is that if X is a random variable with |X |  1 always, then |ED [X ] - ED [X ]|  D - Finally, applying Proposition 2.16, we get O()-concentration D 1. This implies that for any hypothesis h, |err  (h) - D up to order d/ for the original class C of functions X  errD (h)|  /m. In particular, it follows that OptD  {-1, 1}, under any product distribution on X . This leads to OptD + /m. Further, let h be the random variable denoting an agnostic learning result for C under arbitrary product disthe hypothesis A produces when given examples from Dm . tributions which is the same as the one would get for C under By assumption, we have the uniform distribution on {0, 1}n, except for an extra facEm [errD (h)]  OptD +  tor of  in the running time's exponent. 
D

For example, with regard to AC0 functions, [LMN93, Has01] proved the following: ╟ Theorem 4.2 Let f : {0, 1}n  {-1, 1} be computable by an unbounded fan-in circuit of size at most s and depth at most c. Then f is -concentrated up to degree d = (O(log(s/)))c-1 . We therefore may conclude: Theorem 4.3 Let C be the class of functions X1 в Ї Ї Ї в Xn  {-1, 1} computed by unbounded fan-in circuit of size at most s and depth at most c (under the one-out-of-k encoding). Assume |Xi |  poly(n) for each i. Let  be any product distribution on X1 в Ї Ї Ї в Xn . Then every f  C is 2 c -1 / . 1-1/e -concentrated up to order d = (O(log(s/))) As a consequence, polynomial regression agnostically learns with respect to C under arbitrary product distributions in c -1 2 time n(O(log(s/))) / . This result should be compared to the following theorem from Furst, Jackson, and Smith [FJS91] for PAC-learning under bounded product distributions on {0, 1}n: Theorem 4.4 ([FJS91]) The class C of functions {0, 1}n  {-1, 1} computed by unbounded fan-in circuit of size at most s and depth at most c can be PAC-learned under any product c+O(1) distribution in time n(O((1/p) log(s/))) , assuming the mean of each coordinate is in the range [p, 1 - p]. The advantage of the result from [FJS91] is that it does not pay the extra 1/2 in the exponent. The advantages of our result is that it holds under arbitrary product distributions on product sets. (Our result is in the agnostic model, but the result of [FJS91] could also be by applying the results of [KKMS05].)

which is at most OptD +  + /m. Since D - Dm 1  m(/m) = , the key fact applied to errD (h) implies
D  m

m

E [errD (h)]  OptD +  + /m + . E [errD (h)]  OptD + 2 + 2/m,

Finally, as we saw, errD (h)  errD (h) + /m always. Thus
D  m

completing the proof. 2 We will use the above result later when learning under mixtures of product distributions. A simple extension to the case when the distribution is "dominated" by a product distribution was already pointed out in [KKMS05]: Observation 5.2 Let D be a distribution on X which is "C dominated" by a product probability distribution  = 1 в Ї Ї Ї в n ; i.e., for all x  X , D(x)  C (x). If f is concentrated up to degree d under , then f is C -concentrated up to degree d under D. He n c e : Theorem 5.3 Suppose we are in the setting of Theorem 3.4 except that  is any distribution which is C -dominated by a product probability distribution. Then the degree-d polynomial regression algorithm learns with respect to C with 22 4 d = O(C 2 k 2 /4 ) and hence nO(C k / ) time and examples. 5.2 Larger attribute domains So far we have assumed that each attribute space Xi is only of polynomial cardinality. This can fairly easily be relaxed to the assumption that most of the probability mass in each (Xi , i ) is concentrated on polynomially many atoms. Let us begin with some basic preliminaries:


Notation 5.4 Given a distribution  on a set X , as well as a subset X   X , we use the notation   for the distribution on X  given by conditioning  on this set. (We always assume  (X  ) = 0.) Fact 5.5 Let X = X1 в Ї Ї Ї в Xn and let  = 1 в Ї Ї Ї в n be a product distribution on X . Let Xi  Xi , i = 1 . . . n, and write  for the distribution  conditioned on the set   X  = X1 в Ї Ї Ї в Xn . Then  is the product distribution   1 в Ї Ї Ї в n . We now observe that if X  is a "large" subset of X , then any two functions which are close in L2 (X , ) are also close in L2 (X  ,  ): Proposition 5.6 In the setting of Fact 5.5, suppose that Prxi i [xi  Xi ]  1/(2n) for all i. Then for any two functions f : X  R and g : X  R, f |X  - g |X  2,  2 Ї f - g 2, 2 2 where f |X  : X   R denotes the restriction of f to X  , and similarly for g |X  . Proof: Writing h = f - g , we have h
2 2,

Our algorithm will learn whenever all n attribute sets are, say, (/n, poly(n))-bounded. The first step of the algorithm will be to determine a set of attribute values which contain almost all of the probability mass. Lemma 5.9 Let (X,  ) be an ( , r)-bounded probability space. Let Z be a set of m = r ln(r/ )/ samples drawn independently from  . Define Y to be the set {x  X : x was sampled in Z }. Then with probability at least 1 -  , the set Y satisfies Prx [x  Y ]  2 . / Proof: In fact, we will prove the slightly stronger statement that with probability at least 1 -  the set Y satisfies Prx [x  Y  X  ]  2 , where X  is any set fulfilling the / ( , r)-boundedness condition of (X,  ). To prove the claim, we split the sampling procedure into r epochs, where we draw ln(r/ )/ samples in each epoch. Let Yi be the set of all atoms in X sampled among the first i epochs, with Y0 denoting the empty set. We will prove that with probability at least 1 -  , the following holds for all epochs i  [r]: either Yi-1 satisfies Prx [x  Yi-1  / X  ]  2 , or (Yi  X  ) \ Yi-1 =  (i.e., we see a "new" atom from X  in the ith epoch). Let's first note that satisfying the above conditions implies that in the end Prx [x  Y  X  ]  2 . This is / straightforward: if at any epoch Yi-1 satisfies Prx [x  / Yi-1  X  ]  2 then we're done because Y  Yi-1 . Otherwise, in all r epochs we see a new atom from X  , and hence at the end of the sampling we will have seen r distinct atoms of X  ; then |X  |  r implies that our final Y  X  . Now to complete the proof let us bound the probability that for a given i  [r] the Yi-1 does not satisfy Prx [x  / Yi-1  X  ]  2 and we do not see a new element of X  in the ith epoch. Note that if Prx [x  Yi-1  X  ] > 2 , then / the fact that Prx [x  X  ]   implies that Prx [x  / X  \ Yi-1 ] >  . So the probability that we do not observe any element of X  \ Yi-1 in ln(r/ )/ samples is
x 

= =

x x

E [h(x)2 ]

Pr [x  X  ] Ї E [h(x)2 | x  X  ]
x   x x  

+ Pr [x  X ] Ї E [h(x)2 | x  X  ]. / / Using Ex [h(x) | x  X ]  0, we have / h
2 2,

 =

x  x 

Pr [x  X  ] Ї E [h(x)2 | x  X  ]
x  x 

Pr [x  X  ] Ї E  [h(x)2 ].

But by the union bound Pr [x  X  ]  / in Pr [xi  Xi ]  n Ї 1/(2n) = 1/2, /

x

=1

x i  i

so Prx [x  X ]  1/2. Thus 2Ї h
2 2, x 


Pr [x  X  \ Yi-1 ]ln(r/)/ /

 
(1 -  )ln(r/)/

 E  [h(x)2 ] = f |X  - g |X  2, , 2

e-Їln(r/)/ =  /r.

completing the proof. 2 Corollary 5.7 In the setting of the previous proposition, if f is -concentrated up to order d under , then f |X  is 2concentrated up to order d under  . Proof: It suffices to observe that if g : X  R is a function of order d, then g |X  is also a function of order d. 2 We can now describe an extended learning algorithm which works when the attribute spaces are mostly supported on sets of polynomial cardinality: Definition 5.8 We say that a finite probability space (X,  ) is ( , r)-bounded if there exists a subset X   X of cardinality at most |X  |  r such that Prx [x  X  ]   . /

By applying the union bound, we see that there is probability at most  that any of the r epochs fails, so we're done. 2 We now give our extended learning algorithm: 1. Draw a set Z 1 of m1 unlabeled examples. 2. Draw a set Z 2 of m2 labeled examples. 3. Delete from Z 2 any instance/label pair where the instance contains an attribute value not appearing in Z 1 . 4. Run the degree-d polynomial regression algorithm on Z 2 .


Theorem 5.10 Let  = 1 в Ї Ї Ї в n be a product distribution on the set X = X1 в Ї Ї Ї в Xn and assume each probability space (Xi , i ) is (/n, r)-bounded. Write N = nr. Let C be a class of functions X  {-1, 1} each of which is 2 -concentrated up to order d. Set m1 = poly(N/) and m2 = poly(N d /). The above algorithm uses poly(N d /) time and examples and agnostically learns with respect to C under . Proof: For simplicity we will equivalently prove that the algorithm outputs a hypothesis with error at most Opt + O(), rather than Opt + . We first want to establish that with probability at least 1 - , the set of attributes observed in the sample Z 1 covers almost all of the probability mass of . For each i  [n], let Xi be the set of attribute values from Xi observed in at least one of the samples in Z 1 . Using the fact that each (Xi , i ) is (/n, r)-bounded, Lemma 5.9 implies that for sufficiently large m1 = poly(N/) log(1/), each Xi will / satisfy Prxi i [xi  Xi ]  2/n except with probability at most /n. Applying the union bound, all Xi simultaneously satisfy the condition with probability at least 1-. We hence  forth assume this happens. Writing X  = X1 в Ї Ї Ї в Xn ,  we note that, by the union bound, Prx [x  X ]  2. The second thing we establish is that we do not throw away too many examples in Step 3 of the algorithm. We have just observed that the probability a given example in Z 2 is deleted is at most 2. We may assume 2  1/2, and then a Chernoff bound (and m2  log(1/)) easily implies that with probability at least 1 - , at most, say, two-thirds of all examples are deleted. Assuming this happens, we have that even after deletion, Z 2 still contains at least poly(N d /) many examples. We now come to the main part of the proof, which is based on the observation that the undeleted examples in Z 2 are distributed as i.i.d. draws from the restricted product distribution  gotten by conditioning  on X  . Thus we are in a position to apply our main learning result, Theorem 2.12. The polynomial regression part of the above algorithm indeed uses poly(N d /) time and examples, and it remains to analyze the error of the hypothesis it outputs. First, we use the fact that each function f in C is 2 concentrated up to order d to conclude that each function f |X  in "C |X  " is 22 -concentrated up to order d. This uses Proposition 5.6 and the fact that we may assume 2  1/2. Next, the guarantee of Theorem 2.12 is that when learning the target classifier t (viewed as a function X  {-1, 1} or X   {-1, 1}), the expected error under  of the hypothesis h produced is at most Opt + O(), where Opt = min
f C |X  x

and we conclude that the expected error under  of h on t is at most Opt + 2 + O() = Opt + O(). Finally, the same observation implies that the expected error under  of h on t is at most Opt + 2 + O() = Opt + O(). We have thus established that with probability at least 1 - 2, the polynomial regression part of the above algorithm outputs a hypothesis with expected error at most Opt + O(). It follows that the overall expected error is at most Opt + O(), as desired. 2 5.3 Real-valued attributes We next consider the particular case of learning with respect to linear threshold functions, but when some of the attributes are real-valued. This case is relatively easily handled by discretizing the ranges of the distributions and using the previously discussed techniques. Our approach works for a very wide variety of distributions on R; these distributions need not even be continuous. We only need the distributions to satisfy "polynomial boundedness and anti-concentration" bounds. Definition 5.11 We say that a distribution D over R is B polynomially bounded if for all  > 0, there is an interval I of length at most poly(B / ) such that PrxD [x  I ]   . Definition 5.12 Given a real-valued random variable x with distribution D, recall that the Levy (anti-)concentration funcЄ tion Q(x; ) is defined by Q(x; ) = sup Pr [x  [t - /2, t + /2]] .
tR xD

We say that D has B -polynomial anti-concentration if Q(D; )  poly(B ) Ї c for some positive c > 0. Note that if D is a continuous distribution with pdf everywhere at most B then it has B -polynomial anti-concentration (with c = 1 in fact). Having polynomial boundedness and concentration is an extremely mild condition; for example, the following familiar continuous distributions are all B -polynomial bounded and have B -polynomial anti-concentration: Gaussians with 1/B   2  B ; exponential distributions with 1/B    B ; Laplace distributions with scale parameter with 1/B  b  B ; Pareto distributions with shape parameter 1/B  k  B ; chi-square distributions with variance 1/B   2  B (for 1 degree of freedom, the anticoncentration "c" needs to be 1/2); etc. (Furthermore, in most cases even the condition on the parameter being in [1/B , B ] can be eliminated. For example, suppose the first coordinate has a Gaussian distribution with standard deviation  . With O(log(1/ )) examples, one can with probability at least 1 -  estimate  to within a multiplicative factor of 2. Having done so, one can multiply all examples' first coordinate by an appropriate constant so as to get a Gaussian distribution with standard deviation in [1/2, 2]. Further, this does not change the underlying agnostic learning problem, since the class of linear threshold functions is closed under scaling a coordinate. For clarity of exposition, we leave further considerations of this sort to the

Pr [f  (x) = t(x)].

By definition, there is a function f  C satisfying
x 

Pr [f (x) = t(x)] = Opt.

Since Prx [x  X  ]  2, it is easy to see that f |X  has / error at most Opt + 2 on t under  . Thus Opt  Opt + 2,


reader.) We now describe the effect that discretizing a real-valued distribution can have with respect to functions of linear threshold functions. It is convenient to switch from working with a distribution on X and target function X  {-1, 1} to just having a distribution D on X в {-1, 1} -- see the discussion after definition of agnostic learning in Section 1.1. As usual, assume that X = X1 в Ї Ї Ї в Xn is a product set and that the marginal distribution of D on X is a product distribution. Suppose we have one coordinate with a real-valued distribution; without loss of generality, say X1 = R, and write D1 for the marginal distribution of D on X1 . When we refer to a "linear threshold function" on X , we assume that the "weight function" w1 : X1  R for coordinate 1 is just w1 (x1 ) = c1 x1 for some nonzero constant c1 . Lemma 5.13 Let C denote the class of functions of k linear threshold functions over X . As usual, write
f C

It is an easy and well-known fact that if x and y are independent random variables then Q(x + y ; )  Q(x; ); hence
x1 D1 Y

Pr [|x1 + Y /c1 |   /2]  Q(x1 ;  /2).

But D1 has B -polynomial anti-concentration, so Q(x1 ;  /t)  poly(B ) Ї  (1) , as needed. 2 By repeating this lemma up to n times, it follows that even if all n coordinate distributions are real-valued, so long as they have poly(n)-polynomial anti-concentration we will suffer little error. Specifically (assuming k  poly(n) as well), by taking  = poly(/n) we get that discretization only leads to an additional error of . Finally, note that if a distribution Di is poly(n)-polynomially bounded then its discretized version is (/n, poly(n/))-bounded in the sense of Section 5.2; this lets us apply Theorem 5.10. Summarizing:

Opt = inf errD (f ),

Pr [f (x) = y ]. Theorem 5.14 Let  = 1 в Ї Ї Ї в n be a product distribution on the set X = X1 в Ї Ї Ї в Xn . For the finite Xi 's, assume each is (/n, poly(n/))-bounded. For the real Xi 's, Consider discretizing X1 = R by mapping each x1  R to assume the associated i is poly(n)-polynomially bounded  rd (x1 ), the nearest integer multiple of  to xi . Write X1 = and has poly(n)-polynomial anti-concentration. Let C de   Z and let D denote the distribution on X1 в X2 в Ї Ї Ї Xn в note the class of functions of at most k  poly(n) linear {-1, 1} induced from D by the discretization.7 Write Opt 2 4  threshold functions over X . Then there is a poly(n/)k / for the quantity analogous to Opt for D . Then if D1 has B time algorithm which agnostically learns with respect to C polynomial anti-concentration, it holds that Opt  Opt + (1) under . k Ї poly(B ) Ї  . where errD (f ) =
(x , y ) D

Proof: It suffices to show that for any f  C , k Ї poly(B ) Ї  (1)  |errD (f ) - errD (f )| P . = r [f (x) = y ] - Pr  [f (x) = y ]
(x , y ) D (x , y ) D

5.4 Mixtures of product distributions So far we have only considered learning under distributions D that are product distributions. In this section we show how to handle the commonly-studied case of mixtures of product distributions. The first step is to show a generic learning-theoretic reduction: Roughly speaking, if we can agnostically learn with respect to any one of a family of distributions, then we can agnostically learn with respect to a known mixture of distributions from this family -- even a mixture of polynomially many such distributions. (In our application the family of distributions will be the product distributions, but our reduction does not rely on this.) Although the following theorem uses relatively standard ideas, we do not know if it has appeared previously in the literature:

Writing  for the marginal of D on X , we can prove the above by proving
x 

Pr [f (x) = f (rd (x1 ), x2 , . . . , xn )]  k Ї poly(B ) Ї  (1) .

Since f is a function of some k linear threshold functions, by the union bound it suffices to show
x 

Pr [h(x) = h(rd (x1 ), x2 , . . . , xn )]  poly(B ) Ї  (1)

Theorem 5.15 Let D be a family of distributions over an instance space X . There is a generic reduction from the Pr [sgn(c1 x1 +Y ) = sgn(c1 rd (x1 )+Y )]  poly(B )Ї (1) , problem of agnostically learning under a known mixture of x 1 D 1 Y c distributions from D to the problem of agnostically learning under a single known distribution from D. The reduction where Y is the random variable distributed according to the incurs a running time slowdown of poly(cT )/ for an addiother part of the linear threshold function h. Note that Y and tional error of  , where T denotes the maximum time needed x1 are independent because  is a product distribution. Now to compute D(x) for a mixture component D. since |x1 - rd (x1 )| is always at most  /2, we can only have sgn(c1 x1 + Y ) = sgn(c1 rd (x1 ) + Y ) if Proof: Suppose we are agnostically learning (with respect to some class C ) under the distribution D which is a mixture of |c1 x1 + Y |  |c1 | /2  |x1 + Y /c1 |   /2. c distributions D1 , . . . , Dc with mixing weights p1 , . . . , pc . 7 We make the assumption that the learning algorithm knows This can lead to inconsistent labels, which is why we switched each of the mixing weights pi , each of the distributions Di , to D rather than have a target function.

for any linear threshold function h. We can do this by showing


and can compute any of the probabilities Di (x) in time T . We assume in the following that the Di 's are discrete distributions, but the case of absolutely continuous distributions could be treated in essentially the same way. First, we claim that the algorithm can simulate learning under any of the single distributions Di , with slowdown poly(cT )/pi . This is a standard proof based on rejection sampling: given an example x, the algorithm retains it with probability Di (x) ri (x) := pi , (1 ) D(x) a quantity the algorithm can compute in time poly(cT ). One can check that this leads to the correct distribution Di on instances. The probability of retaining an example is easy seen to be precisely 1/pi , leading to the stated slowdown. The main part of the proof now involves showing that if the algorithm agnostically learns under each Di , it can combine the hypotheses produced into an overall hypothesis which is good under D. We will deal with the issue of running time (in particular, very small pi 's) at the end of the proof. Let Opt denote the minimal error achievable among functions in C under D, and write Opti for the analogous quantity under Di , i = 1 . . . c. Since one could use the same c f  C for each Di , clearly Opt  i=1 pi Opti . By reduction, the algorithm produces hypotheses h1 , . . . , hc satisfying E[errDi (hi )]  Opti + . We allow our overall algorithm to output a randomized hypothesis h. We will then show that E[errD (h)]  Opt + . where the expectation is over the subalgorithms' production of the hi 's plus the "internal coins" of h. Having shown this, it follows that our algorithm could equally well produce a deterministic hypothesis, just by (randomly) fixing a setting of h's internal coins as its last step. Assume for a moment that the subalgorithms' hypotheses are fixed, h1 , . . . , hc . The randomized overall hypothesis h : X  {-1, 1} is defined by taking h(x) = hi (x) with probability exactly ri (x), where the probabilities ri (x) are as defined in (1). (Note that they indeed sum to 1 and are computable in time poly(cT ).) Writing t for the target function, we compute: E [errD (h)]
h's coins

We now take the expectation over the production of the subhypotheses and conclude E[errD (h)] =
h

ic

=1

pi E[errDi (hi )]  = ic

ic

pi (Opti + )

=1

=1

pi Opti +   Opt + , (2)

as claimed. It remains to deal with small pi 's and analyze the running time slowdown. We modify the overall algorithm so that it only simulates and learns under Di if pi   /c. Thus the simulation slowdown we incur is only poly(cT )/ , as desired. For any i with pi <  /c we use an arbitrary hypothesis hi in the above analysis and assume only errDi (hi )  1. It is easy to see that this incurs an additional error in (2) of at i most :pi < /c pi   , as necessary. 2 Combining Theorem 5.15 with, say, Theorem 3.4 (for simplicity), we may conclude:

Theorem 5.16 Let D be any known mixture of poly(n) product distributions over an instance space X = X1 в Ї Ї Ї в Xn , where we assume |Xi |  poly(n) for each i. Then there is a 2 4 nO(k / ) -time algorithm for agnostically learning with respect to the class of functions of k linear threshold functions over X under D. When the mixture of product distributions is not known a priori, we can first run an algorithm for learning mixtures of product distributions from unlabeled examples. For example, Feldman, O'Donnell, and Servedio [FOS05] proved the following: Theorem 5.17 ([FOS05]) Let D be an unknown mixture of c = O(1) many product distributions over an instance space X = X1 в Ї Ї Ї в Xn , where we assume |Xi |  O(1) for each i. There is an algorithm which, given i.i.d. examples from D and  > 0, runs in time poly(n/ ) log(1/ ) and with probability at least 1 -  outputs the parameters of a mixture of c product distributions D satisfying D - D 1   . (The theorem was originally stated in terms of KL-divergence but also holds with L1 -distance [FOS05].) In [FOS06] the same authors gave an analogous result for the case when each Xi = R and each product distribution is a product of Gaussians with means and variances in [1/poly(n), poly(n)]. We conclude:

= =

xD h's coins

xD

E [ Pr [h(x) = t(x)]]   i ri (x) E
:hi (x)=t(x)

= =

xD

x

 i E i

pi (x)

:hi (x)=t(x)

X :hi (x)=t(x)

pi (x)Di (x) Di (x) =

Di (x)  D(x) ic


=

ic

pi

=1

x

pi errDi (hi ).

:hi (x)=t(x)

=1

Theorem 5.18 Let D be any unknown mixture of O(1) product distributions over an instance space X = X1 в Ї Ї Ї в Xn , where we assume either: a) |Xi |  O(1) for each i; or b) each Xi = R and each product distribution is a mixture of axis-aligned (poly(n)-bounded) Gaussians. Then there is a 2 4 nO(k / ) -time algorithm for agnostically learning with respect to the class of functions of k linear threshold functions over X under D.


Proof: First use the results of [FOS05, FOS06] with  = 2 4 /nO(k / ) , producing a known mixture distribution D with 2 4 D - D 1  /nO(k / ) . Then run the algorithm from Theorem 5.18. The conclusion now follows from Proposition 5.1. 2

6 Conclusions
In this work, we have shown how to perform agnostic learning under arbitrary product distributions and even under limited mixtures of product distributions. The main technique was showing that noise sensitivity bounds under the uniform distribution on {0, 1}n yield the same noise sensitivity bounds under arbitrary product distributions. The running time and examples required by our algorithm are virtually the same as those required for learning under the uniform distribution on {0, 1}n. While we have established many interesting scenarios for which polynomial regression works, there is still significant room for extension. One direction is to seek out new concept classes and/or distributions for which polynomial regression achieves polynomial-time agnostic learning. Our work has dealt mostly in the case where all the attributes are mutually independent; it would be especially interesting to get learning under discrete distributions that are far removed from this assumption.

References
[BKS99] Itai Benjamini, Gil Kalai, and Oded Schramm. Noise sensitivity of Boolean functions and applications to percolation. Publ. Math. de Є l'IHES, 90(1):5ґ43, 1999. Corinna Cortes and Vladimir Vapnik. Supportvector networks. Machine Learning, 20(3):273ґ 297, 1995. Merrick Furst, Jeffrey Jackson, and Sean Smith. Improved learning of AC0 functions. In Proc. 4th Workshop on Comp. Learning Theory, p ag es 3 1 7 ґ 3 2 5 , 1 9 9 1 . Jonathan Feldman, Ryan O'Donnell, and Rocco Servedio. Learning mixtures of product distributions over discrete domains. In Proc. 46th IEEE Symp. on Foundations of Comp. Sci., p ag es 5 0 1 ґ 5 1 0 , 2 0 0 5 . Jonathan Feldman, Ryan O'Donnell, and Rocco Servedio. Pac learning mixtures of gaussians with no separation assumption. In Proc. 19th Workshop on Comp. Learning Theory, pages 20ґ34, 2006. Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. In Proc. 47th IEEE Symp. on Foundations of Comp. Sci., pages 543ґ552, 2006. J. Hastad. A slight sharpening of LMN. J. of ╟ Computing and Sys. Sci., 63(3):498ґ508, 2001. Wassily Hoeffding. A class of statistics with asymptotically normal distribution. Ann. Math. Stat., 19(3):293ґ325, 1948.

[CV95] [FJS91]

[FOS05]

[FOS06]

Adam Kalai. Machine learning theory course notes. http://www.cc.gatech.edu/atk/teaching/ mlt06/lectures/mlt-06-10.pdf, 2006. [KKMS05] Adam Kalai, Adam Klivans, Yishay Mansour, and Rocco Servedio. Agnostically learning halfspaces. In Proc. 46th IEEE Symp. on Foundations of Comp. Sci., pages 11ґ20, 2005. [KOS04] Adam Klivans, Ryan O'Donnell, and Rocco Servedio. Learning intersections and thresholds of halfspaces. J. of Computing and Sys. Sci., 68(4):808ґ840, 2004. [KR82] Samuel Karlin and Yosef Rinott. Applications of Anova type decompositions for comparisons of conditional variance statistics including jackknife estimates. Ann. Stat., 10(2):485ґ501, 1982. [KSS94] Michael Kearns, Robert Schapire, and Linda Sellie. Toward efficient agnostic learning. Machine Learning, 17(2):115ґ141, 1994. [LBW95] Wee Sun Lee, Peter Bartlett, and Robert Williamson. On efficient agnostic learning of linear combinations of basis functions. In Proc. 8th Workshop on Comp. Learning Theory, p ag es 3 6 9 ґ 3 7 6 , 1 9 9 5 . [LMN93] Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, Fourier transform, and learnability. Journal of the ACM, 40(3):607ґ620, 1993. [MOO05] Elchanan Mossel, Ryan O'Donnell, and Krzysztof Oleszkiewicz. Noise stability of functions with low influences: invariance and optimality. In Proc. 46th IEEE Symp. on Foundations of Comp. Sci., pages 21ґ30, 2005. [MP69] Marvin Minsky and Symour Papert. Perceptrons. MIT Press, 1969. [ O' D0 3 ] Ryan O'Donnell. Computational aspects of noise sensitivity. PhD thesis, MIT, 2003. [OS03] Ryan O'Donnell and Rocco Servedio. New degree bounds for polynomial threshold functions. In Proc. 35th ACM Symp. on the Theory of Computing, pages 325ґ334, 2003. [Per04] Y. Peres. Noise stability of weighted majority. arXiv:math/0412377v1, 2 0 0 4 . [Ste86] J. Michael Steele. An Efron-Stein inequality for nonsymmetric statistics. Ann. Stat., 14(2):753ґ 758, 1986. [Val84] Leslie Valiant. A theory of the learnable. Comm. of the ACM, 27(11):1134ґ1142, 1984. [ v M4 7 ] Richard von Mises. On the asymptotic distribution of differentiable statistical functions. Ann. Math. Stat., 18(3):309ґ348, 1947.

[Kal06]

[GR06]

[Has01] ╟ [ Ho e 4 8 ]