On the Interaction between Norm and Dimensionality: Multiple Regimes in Learning Percy Liang pliang@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 94720, USA Nati Srebro Toyota Technological Institute at Chicago, Chicago, IL 60637, USA nati@ttic.edu Abstract A learning problem might have several measures of complexity (e.g., norm and dimensionality) that affect the generalization error. What is the interaction between these complexities? Dimension-free learning theory bounds and parametric asymptotic analyses each provide a partial picture of the full learning curve. In this paper, we use high-dimensional asymptotics on two classical problems--mean estimation and linear regression--to explore the learning curve more completely. We show that these curves exhibit multiple regimes, where in each regime, the excess risk is controlled by a subset of the problem complexities. What we really want to understand is the true behavior of the excess risk En (B, d) as a function of the sample size n, norm B, and dimensionality d. In this paper, we analyze the excess risk for two classical problems--mean estimation (Section 2) and linear regression (Section 3)--by performing a highdimensional asymptotic analysis. In particular, we allow the complexity (B, d) of the problem to grow with the sample size n, so that the excess risk En (B, d) ~ ~ converges to a non-vanishing asymptotic limit E(B, d), ~ ~ where B and d are the rescaled complexities. We then ~ ~ study this limiting function as B and d vary to see the interaction between norm and dimensionality. We show how the excess risk can have multiple regimes, where in each regime, the excess risk is controlled by a subset of the relevant complexities. Furthermore, we find that the transitions between regimes are smooth, even asymptotically. Notation For a vector v Rd , we write v = vv . Let Xn = Op (1) denote that the sequence of random variables (Xn )n1 is bounded in probability, that is, for every > 0, there exists M < such that supn P (Xn > M ) . We write Xn = Op (Yn ) n to mean Xn = Op (1). Let Xn - 0 denote con Y vergence in probability, that is, for every > 0, limn P (|Xn - X| > ) = 0. When we use big-O notation, only universal constants are hidden, never parameters of the learning problem. 1. Introduction Most analyses of learning algorithms proceed by identifying a measure of complexity of the learning problem and then provide either bounds or asymptotic expressions for the generalization error (risk) in terms of that complexity. For instance, for linear models, bounds based on Rademacher complexity (Bartlett & Mendelson, 2001), covering numbers (Pollard, 1984), or online learning (Cesa-Bianchi & Lugosi, 2006) depend on the norm (in relation to the variance of data and the noise) and not the dimensionality. On the other hand, classical parametric asymptotic analyses (van der Vaart, 1998; Liang et al., 2010) provide answers that depend only on the dimensionality and not the norm. There seems to be some tension here: If the sample complexity depends asymptotically only on the dimensionality, how can it be bounded in terms of only the norm? Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). P 2. Constrained Mean Estimation A classical problem in statistics is estimating the mean of a multivariate Gaussian from i.i.d. samples. We consider a variant of this problem where the norm of mean vector is constrained to a Euclidean ball. Even in this simple problem, we will see that two learning regimes emerge: a random regime controlled by the norm and a unregularized regime controlled by the dimensionality. On the Interaction between Norm and Dimensionality: Multiple Regimes in Learning 2.1. Setup The mean estimation problem is defined as follows: Let µ Rd be the unknown mean vector that we wish to estimate, and let B = µ 2 denote its norm. We obtain n i.i.d. training points: X (i) N (µ , 2 Id×d ) for i = 1, . . . , n. Let the tuple = (B, d, 2 ) specifies an instance of the mean estimation problem, which includes the norm B, dimensionality d, and data variance 2 . Given a vector µ Rd , we measure its generalization error (risk) using squared loss: (µ) = EXN (µ ,2 I) (X - µ)2 . def Definition 1 (Limiting Problem Specification). A se2 2 quence of mean estimation problems n = (Bn , dn , n ) ~ = (B 2 , d, 2 ) if ~ ~~ has a limit 2 ~ Bn B 2 , 2 dn n ~ d, n 2 n 2 ~ (5) as n . ~ Intuitively, the limiting problem specification captures the essence of the mean estimation problem. The following proposition gives a precise handle of the excess risk in this limit: Proposition 1. Suppose a sequence of mean estima2 ~ tion problems n = (Bn , dn , n ) has a limit = ~~ ~ d, 2 ). Then the excess risk (3) has the following (B, asymptotic limit: def P En = En (n ) - E(), ~ (1) Define the following estimator which minimizes the empirical risk subject to a norm constraint: n (6) µn = argmin ^ µ 2 =B def (X i=1 (i) - µ) . 2 (2) Our goal is to study the excess risk of µn , defined as ^ follows: def En () = (^n ) - (µ ). µ (3) 2.2. Preliminary Analysis We first derive a closed form solution for the estimator µn . Consider the Lagrangian of the constrained opti^ n mization problem in (2): L(µ, ) = i=1 (X (i) - µ)2 + 2 µ . Differentiating L with respect to µ and setting ¯ n 1 X ¯ it to zero, we get µ = 1+ , where X = n i=1 X (i) is the empirical mean. To satisfy the constraint µ 2 = ¯ B, we must have µn = B X2 . The estimator µn is ^ ^ ¯ X ¯ just the unconstrained estimator X projected onto the radius-B sphere. Next, we decompose the risk in (1) into two orthogonal parts: (µ) = E[(X-µ )2 ]+(µ-µ )2 = d 2 +(µ-µ )2 . Plugging the derived expressions for µn and (µ) into ^ (3) yields the following expression for the excess risk: En () = ¯ BX ¯ 2 -µ X 2 where the asymptotic excess risk is 1 ~ ~ E() = 4B 2 sin2 arctan 2 ~ d . ~ B2 (7) ~ Note that E() is a non-random function; this is because in high dimensions, the excess risk concentrates. Before proving the proposition, let us establish some ~ intuitions about the regimes that E() exhibit by vary~ ing : ~ ~ · Random Regime (B 2 d): When the rescaled di~ mensionality d is large, the arctan term tends to ; 2 1 also, sin2 ( ) = 2 , so the asymptotic excess risk is 4 ~ ~ E 2B 2 . In this regime, the norm B dominates the ~ excess risk and the dimensionality d is irrelevant. Geometrically, the estimator µn essentially pro^ duces a random point on a (dn - 1)-dimensional sphere, whose squared distance from µ concenn 2 trates around 2Bn . ~ ~ · Unregularized Regime (d B 2 ): When the rescaled dimensionality is small, then 1 4 sin2 ( 2 arctan(x)) x2 , so the excess risk is ~ Here, the dimensionality d dominates the ~ E d. ~ is irrelevant. excess risk and the norm B This regime is very closely related to parametric asymptotics. The maximum likelihood estimator 2 n ¯ Xn has excess risk exactly n · 2n , where 2n ded d notes a 2 random variable with dn degrees of free2 dom, which has mean dn . When Bn is large, the ¯ sphere looks locally flat, so the projection of Xn onto its surface simply removes an insignificant degree of freedom. . (4) 2 ¯ Note that X is distributed as N (µ , ). n 2.3. Asymptotic Analysis To analyze the excess risk En (), we turn to asymptotics to simplify the form of En (). In particular, we 2 consider a sequence of problems n = (Bn , dn , n ) so that the excess risk En (n ) converges to some non~ vanishing quantity E() as n . The allowed sequences n are given in the following definition: On the Interaction between Norm and Dimensionality: Multiple Regimes in Learning Vn - P ~ d 1.28 excess risk 0.64 0.32 0.16 0.08 0.04 actual (En ) asymptotic (E) upper bound (E B ) P ~ Un - B { µ n En µn ^ ¯ Xn n { 101 102 103 104 Figure 2. Geometric depiction of the excess risk En (used in the proof of Proposition 1). The constrained estimator ¯ µn is obtained by projecting Xn down to the radius-Bn ^ sphere. The key is that in high dimensions, the two random components Un and Vn both concentrate. n Figure 1. On constrained mean estimation: Log-log plot 2 2 of the excess risk for dn = 1000, Bn = 1, n = 1. The 2 ~ corresponding limiting values are B 2 = Bn = 1 and 2 d n n ~ d = n = 1000 . One can clearly see the two regimes n marked by their slopes (0 and -1, respectively): In the ran~ dom regime, the norm B controls the excess risk, while in ~ the unregularized regime, the dimensionality d does. The ~ ~ bound E B = min{2B 2 , d} represents the ends of E quite accurately, but misses the smooth transition in the middle. We can obtain En using basic trigonometry. First, n compute the angle n = arctan Un . Bisecting n V and converting angles back to lengths yields En = 2Bn sin( 1 n ). Putting everything together, we have 2 2 En = 4Bn sin2 Based on the previous discussion, we can actually stitch together the following upper bound, ~ def ~ ~ ~ E () = min 2B 2 , d E(), B 1 arctan 2 Un Vn . (9) (8) Now we compute the limits of Un and Vn . First, Un includes the small deviation along the first component: 2 P ~ n ) - B. Vn includes the deviations Un = Bn +Op ( n which clearly marks out the two regimes. We can see how well the asymptotics represent the actual excess risk En for finite n by plotting the learning curve (Figure 1), increasing n while keeping n fixed. Note that increasing n decreases the rescaled ~ dimensionality d. From Figure 1, we see that while B the bound E matches the asymptotic E at the ends, there is a noticeable gap when transitioning between the two regimes. In particular, the bound is piecewise linear whereas asymptotic curve is smooth, tracking the empirical excess risk much more closely. We note that the transition is smooth even in the asymptotic limit; this is due to the smoothness of the loss function. Proof of Proposition 1. Without loss of generality, assume µ = (Bn , 0, . . . , 0) because the problem is ron tationally invariant. First, let us decompose the pre-projected estimator ¯ Xn into two components: (1) the component in the def ¯ subspace of µ , which has length Un = |Xn1 |, and n (2) the component orthogonal to µn , which has length def dn ¯ 2 X . Figure 2 depicts the setup: the Vn = j=2 nj along the other d - 1 components, which amounts to 2 P n 2 ~ Vn = - d. Since En depends on Un and n dn -1 Vn only via smooth trigonometric functions (9), we can apply the continuous mapping theorem to obtain P En - E as desired. 3. Regularized Linear Regression We now turn to norm-regularized linear regression. We first analyze a componentwise estimator (which treats each parameter separately), showing that even in this simple case, the asymptotic excess risk exhibits three regimes, not two as in mean estimation. For the full least squares estimator, we use a combination of upper bounds to hypothesize the existence of four regimes. 3.1. Setup We assume that data are generated as follows: X N (0, d×d ) and Y = X, + W , where Rd is the true parameter and W N (0, 2 ) is independent noise. Let p (X, Y ) denote the resulting distribution. We consider a linear regression problem to be fully specified by the tuple = (, , 2 ). Given a predictor Rd , we are interested in its excess risk En we want to compute is denoted geometrically. On the Interaction between Norm and Dimensionality: Multiple Regimes in Learning generalization error (risk), which averages the squared loss over test points: () = E(X,Y )p [(Y - X, )2 ]. def (10) Given n i.i.d. training points {(X (i) , Y (i) )}n drawn i=1 from p , define the regularized least-squares estimator: 1 ^ def n = argmin Rd n n form. First, conditioned on X (1) , . . . , X (n) , we can ^ integrate out the W (1) , . . . , W (n) in S by indepen^ | X (1) , . . . , X (n) ] = 2 . ^ dence; this yields E[S n ^ Next, the inverse covariance matrix -1 has an inverse 1 ^ Wishart distribution (-1 W -1 ( n , n)), which has -1 n mean n-d-1 . Putting everything together, we obtain 0 E[En ()] = d 2 n-d-1 . Y (i) - X (i) , i=1 2 + 2 , (11) where 0 is the regularization parameter. We are interested in analyzing the excess risk of this estimator: ^ En () = (n ) - ( ). def (12) It is interesting to note that the excess risk does not depend on and , but only on the dimensionality d. The norm of does not play a role at all because the unregularized estimator is shift-invariant. The covariance of the data does not play a role due to the following intuition: the larger is, the easier it is to estimate , but the harder it is to predict. The two forces cancel out exactly. 3.3. Componentwise Estimator Asymptotics In this section, we introduce and analyze a simple estimator that still provides additional insight into the interaction between norm and dimensionality. Define the componentwise least-squares estimator, which esti^ mates each component of Rd separately, as follows: ^ j = where j = jj . ^2 ^ The componentwise estimator consistently estimates regardless of whether is diagonal. When is diagonal, the excess risk is just the sum across the components, where component involves a one-dimensional regression problem. Without regularization, the expected excess risk of the componentwise estimator is d 2 n-2 . Note this is smaller than the excess risk of the d full unregularized estimator, which is n-d-1 . We effectively gain an effective sample of size d-1 by exploiting knowledge of the eigenstructure of . 2 We also consider the excess risk of an oracle estimator which chooses the optimal (in a data-dependent way) to minimize the excess risk: En () = inf En (). 0 def (13) Assumption 1 (Diagonal Covariance). Assume that the covariance matrix of the data is diagonal, that is, 2 2 = diag(1 , . . . , d ). This assumption is made without loss of generality for the estimators we have defined so far, which are rotationally invariant. This is not true for the componentwise estimator that we will introduce later. 3.2. Preliminary Analysis ^ def 1 n Define the empirical covariance = n i=1 X (i) ^ def 1 n and let S = n i=1 X (i) W (i) . First, we solve (11) in the standard way by differentiating and setting the ^ ^ ^ ^ result to zero to get n = ( + I)-1 ( + S). Next, since the noise W is independent of X, we can 2 rewrite (10) as () = + tr{( - ) }. Apn plying these two derived expressions to (12), we get ^ En () = tr{(n - ) }. Using some algebra, we can write the parameter error ^ ^ ^ ^ as n - = -( + I)-1 + ( + I)-1 S. Putting the last two equations together yields: ^ ^ ^ En () = tr{[(+I)-1 -(+I)-1 S] }. (14) j j + Sj ^2 ^ j + ^2 j = 1, . . . , d, (15) Unregularized Estimator Before we consider regularization, let us comment on the unregularized estimator (when = 0). In this case, the excess risk 0 0 ^ ^ ^ En () in (14) simplifies to En () = tr{-1 S -1 }. 0 We can compute the expectation of En () in closed In this section, we analyze the excess risk of the componentwise estimator using asymptotics. The lack of covariance structure simplifies the math considerably. Consider a sequence of regression problems 2 2 n = (n , n , n ). We do not yet commit to a particular scaling, but we do impose the following constraints: Assumption 2 (Constraints on Limiting Problem d 2 Specification). Assume that lim supn nn n < and dn lim supn j=1 |nj |nj < . We now derive the asymptotic excess risk of the oracle componentwise estimator: Proposition 2. Consider a sequence of regression 2 2 problems n = (n , n , n ) satisfying Assumption 2. On the Interaction between Norm and Dimensionality: Multiple Regimes in Learning Define dn En = def j=1 2 2 2 4 2 nj nj n nj + . 2 2 (nj + )2 n(nj + )2 squared bias variance the randomness in En . Using these variables, define: dn 2 2 3 2 nj nj - 2nj n nj Rj1 + 2 (nj (1 + Rj2 ) + )2 2 n 4 n nj (16) Fn (R) = j=1 , (19) dn Suppose that there exists a function E such that def sup0 |En - E | 0. Then the excess risk En = En (n ) of the oracle componentwise estimator (13) has the following asymptotic limit E : En = inf En - inf E = E . 0 0 def P def G (R) = n j=1 2 n 4 ^ n nj Hnj 2 (nj (1 + Rj2 ) + )2 . (20) (17) We have constructed Fn and G so that En = Fn (0) n ^ ^ and En = Fn (Rn ) + Gn (Rn ), which can be veri fied with some algebra. Intuitively, Fn (0) captures the non-random problem-dependent part of the excess ^ ^ risk; Rn and G (Rn ) contribute the random problemn independent part. For a fixed , we will show that the excess risk En converges to a non-random asymptotic excess risk E , using En as an intermediate quantity that intuitively removes the randomness from En . The concentration of En around En must be established. ^ Let An be the event that Rn 1 . On event An , 2 ^ Lemma 1 below will show that Fn (Rn ) 1 M for some constant M independent of n and . Note: norms on the matrices are element-wise. 1 Lemma 1. For all R Rdn ×2 such that R 2 and 0, we have Fn (R) 1 M , where M is a constant independent of n and . def What is new in regression is the minimization over . P To establish inf 0 En - inf 0 E (i.e., switching inf with lim), we need some sort of uniformity over , which occupies most of the following proof. 2 Proof. Let Qnj = nj (1 + Rj2 ) + . For each j, we have Proof of Proposition 2. To prove that En - E , it P suffices to show that sup0 |En -En | - 0. If we have that, the proposition can be established as follows: Let n argmin0 En and argmin0 E . Note n that En = En and E = E . For any > 0, we have n En En En E E n 2 En n 2 En n 2 2 for sufficiently large n with high probability, where a b denote |a - b| . This ensures |E - En | . Now, we need to show that sup0 |En - En | - 0, i.e., that the residual |En - En | goes to zero at a rate that does not depend on . Specializing (14) to the componentwise estimator and expanding yields: P P Fn (R) Rj2 2 2 2 3 = -2Q-3 nj (2 nj nj -2nj n nj Rj1 + nj 2 n 4 n nj ). 1 2 Using the fact that Qnj is larger than 2 nj and Fn (R) Rj2 | , we obtain | 2 2 4nj nj + 8nj n nj + 16 F (R) 2 n n . n Similarly, we can bound | Rj1 | 4nj n nj . Sum the right-hand sides over j = 1, . . . , dn . By Assumption 2, this sum is bounded above by a quantity independent of n and . ^ By the mean value theorem, Fn (Rn ) - Fn (0) = ^ ^ Fn (cRn ) Rn for some c [0, 1], where, abusing no^ tation, Rn is treated as a vector. Applying the lemma with H¨lder's inequality and taking a sup over yields o ^ ^ sup0 |Fn (R) - Fn (0)| M Rn . Note this is all still conditioned on An . dn En = j=1 2 2 ^ ^2 nj (2 nj - 2nj Snj + Snj ) . (^nj + )2 2 (18) ^ Now we want to bound sup0 |G (Rn )|. It sufn fices to take = 0, which is where G attains n ^ its maximum value. Simplifying, we get G0 (Rn ) = n 2 n n ^ Hnj dn ^ j=1 (1+Rnj2 )2 . Note that each summand con- def ^ For each j = 1, . . . , dn , let Rnj1 = def ^ Rnj2 = 2 nj -nj ^2 2 nj = Op (n- 2 ), and 1 ^ 1 Snj -2 ), n nj = Op (n ^2 def nSnj ^ Hnj = 2 2 - 1 = n nj verges in distribution to a 2 distribution minus 1 (which has mean zero), independent of n and . To finish the proof, fix > 0. With high probability, we can take n large enough (in a way that does not ^ depend on n or ) such that (1) R < 2M and event An holds by applying a standard tail bound plus Op (1). Importantly, these variables (1) do not depend on the problem specification n and (2) capture all On the Interaction between Norm and Dimensionality: Multiple Regimes in Learning ^ a union bound over the 2dn = O(n) elements of Rn ; and (2) | d1 n numbers. ^ Hnj dn ^ j=1 (1+Rnj2 )2 | 7.8e-1 1.6e-1 ~ 2d by the law of large excess risk ^ This ensures that both sup0 |Fn (R) - Fn (0)| 2 ^ and sup0 |Gn (R)| 2 , implying that sup0 |En - En | . 3.1e-2 6.3e-3 1.3e-3 2.5e-4 actual (En ) asymptotic (E ) lower bound ( 1 E B ) 4 upper bound (E B ) 3.4. Learning Regimes To get some concrete intuition for Proposition 2, let us specialize the problem specification: Assumption 3 (Two-Part Regression Structure). Let the true parameter vector be n = (1, 0, . . . , 0) Rdn and the data covari2 2 Cn Cn 2 ance be = diag(Bn , dn -1 , . . . , dn -1 ) Rdn ×dn for 2 2 some Bn , Cn > 0. 2 The idea is that Bn captures the squared norm of the signal in the data (which exists only on the first com2 ponent), and Cn captures the squared norm of irrele vant components. The norm of n can always taken to be one without loss of generality, since the true measure of complexity is the product of the norm of the predictor with the norm of the data. 102 104 106 n Figure 3. Componentwise estimator for linear regression: Log-log plot of the learning curve for d = 100, B 2 = 1, C 2 = 10, 2 = 100. There are three regimes, each characterized by a different slope (0 corresponding to a constant excess risk in the random regime, - 1 corresponding to a rate of 2 1 1 in the regularized regime, and -1 corresponding to n n in the unregularized regime). Proposition 2 is the limit of the excess risk En of the oracle estimator. The following proposition sheds light into the multi-regime structure of E : Proposition 3 (Bounds on Regimes). Let Definition 2 (Limiting Problem Specification). A sequence of linear regression prob2 2 2 lems n = (Bn , Cn , dn , n ) converges to a limit ~ = (B 2 , C 2 , d, 2 ) if ~ ~ ~~ 2 2 2 2 ~ ^ ~ C ~ dn n d, 2 2 (21) Bn B 2 , n n C 2 , ~ n n n ~2 def ~ ~ 2C , d . E B = min B 2 , ~ ~ B d (23) The asymptotic excess risk E of the oracle componentwise estimator defined in (15) is bounded by E B to within a factor of four: 1 B E E E B. 4 (24) as n . Note that we allow both the dimensionality dn and the 2 squared norm Cn of the irrelevant components tend to 2 . The presence of Cn will create a new intermediate learning regime. Specializing the asymptotic excess risk from (16) to this problem specification: 2 dn n 2 2 4 n (dn -1)2 Bn 2 n Bn En = + + , 2 2 2 (Bn + )2 n(Bn + )2 j=2 n( Cn + )2 dn -1 C4 We can plot the learning curve as a relationship between the sample size and excess risk, for a fixed specification of the regression problem. Figure 3 shows the actual excess risk En , the asymptotic excess risk E , 1 ~ ~ and bounds 4 E B , E B . Note that C 2 and d scale inversely with n, so that the three regimes scale as 1, 1 , and 1 , respectively. n n The bound (23) indicates three regimes corresponding to each of the three terms: · Random Regime: In this regime, should be large so that the variance V 0 and the squared bias ~ U B 2 (see (22)). This corresponds to simply ^ guessing n = 0. Only the squared norm of the ~ 2 controls the excess risk. signal B · Regularized Regime: In this new regime, must be optimized to balance the squared bias U and variance V terms. The squared norm of the irrelevant which can be shown to converge uniformly across 0 to ~ C4 ~ B 2 2 ~ E = + ~2 d . (22) C ~ 2 + )2 (B ( ~ + )2 d squared bias = U def variance = V def Recall that we are ultimately interested in the excess risk of the oracle estimator E = inf 0 E , which by On the Interaction between Norm and Dimensionality: Multiple Regimes in Learning 7.8e-1 1.6e-1 3.1e-2 7.8e-1 1.6e-1 3.1e-2 1 Now we show the lower bound 4 E B E . Using the 1 1 fact that a b 0 implies (a+b)2 4a2 , we have the following relations: 2 Cn = 10 2 Cn = 30 2 Cn = 100 E 6.3e-3 1.3e-3 2.5e-4 10 2 E dn = 30 dn = 100 dn = 300 6.3e-3 1.3e-3 2.5e-4 ~ ~ (i) > B 2 implies U 1 B 2 , 4 ~ (ii) B 2 implies U 1 2 , 4 10 4 10 6 10 2 10 4 10 6 (iii) > (iv) n (a) Varying dimensionality dn n 2 (b) Varying norm Cn ~ C2 ~ d ~ C2 ~ d implies V implies V ~ 1 d 4 2 ~ C4 , and 1 ~ 4 d. Figure 4. Log-log plots of asymptotic excess risk E (same default parameters as in Figure 3), where we study the im2 pact of varying the norm Cn and dimensionality dn . Varying the dimensionality dn affects both the regularized and 2 unregularized regimes (a); varying the squared norm Cn only affects the regularized regime (b). Take any 0. The plan is to construct L out of two lower bounds on U and V , respectively, based on which of the above relations are satisfied. In doing so, we ensure that L E . We will also show that 1 B min 0 L L . Since this holds for all 4E 1 0, we will have that 4 E B inf 0 E = E . ~ Now we consider the four cases for : If > B 2 (i) and ~ 1 ~2 C2 > d (iii), we have inf L = 4 B with . If ~ ~ ~ ~ ~ 2 (i) and C 2 (iv), inf L = 1 B 2 + 1 d. If >B ~ B 2 (ii) and > 2 ~ components C 2 dominates the excess risk, favorably ~ ~ scaled down by large B and d. The implied de1 pendence of the excess risk on n is n . · Unregularized Regime: should be small so that ~ ~ U 0 and V d. Here, the dimensionality d controls the excess risk, yielding an excess risk of 1 order n , independent of the norm. Figure 4 shows the impact of varying the problem parameters on the various regimes. As one would expect from (23), changing the norm C 2 only affects the intermediate regime, whereas changing the dimensionality d affects both the variance and the intermediate regime. Proof of Proposition 3. We first prove the upper bound E E B . To do this, we need to show that E (defined in terms of (22)) is upper bounded by each of the three terms in (23). ~ To get E B 2 , take (infinite regularization). In this limit, the squared bias term U dominates and ~ converges to B 2 . To show E 2 ~ d ~ C2 ~ d (iii), inf L = 4 4 ~ 1 2C 2 , with 4 ~ ~ B d ~~ = B C , based on an earlier derivation. Finally, if ~ d ~ B 2 (ii) and = 0. ~ C2 ~ d (iv), inf L = 1 ~ 4d with The regularized regime does not always exist. For ex~ ample if the dimensionality is relatively small (d ~2 2C B ~ ), then based on E in (23), the excess risk of the B2 ~2 regularized regime ( 2C ) will be larger than the geo~ B ~ d metric average of the excess risks of the other regimes ~ ~ (B d). In this case, we jump directly from the random regime to the unregularized regime. 3.5. Full Estimator So far, we have analyzed the componentwise estimator. This section offers a partial characterization of the learning curve for the full estimator (11). ~ Clearly, E B 2 by taking (the random ~ regime), and E d by taking = 0 (the unregularized regime). To analyze the intermediate regime, define the constrained estimator to be one which chooses ^ so that n 1, and n denote this . Let E C be the corresponding asymptotic excess risk and note that the oracle asymptotic excess risk E E C . Having not yet been able to derive an exact asymptotic form for E C , we instead offer some speculations based on upper bounds for stochastic optimization (online ^sgd learning). Let n be the estimator obtained by running one pass of stochastic gradient descent over the ~ 2C 2 , ~ d B ~ observe that E = U +V , where ~ C4 d U B 2 and V ~ (22). Optimize the sum of the 2 ~ two bounds with respect to : E inf 0 2 ~ + d2 ~ B2 ~ C4 = ~ 2C 2 , ~ ~ B d 2 (25) ~~ which is attained by setting 2 = B C . ~ d ~ Finally, to show that E d, simply set = 0 (corre~ sponding to no regularization), to get that E = d. On the Interaction between Norm and Dimensionality: Multiple Regimes in Learning training data. Then in expectation over the sample we have sgd En 2 2 2 Bn + C n +2 n 2 2 2 2 (Bn + Cn )n . n This follows from Srebro et al. (2010), which is based on Theorem 1 of Shalev-Shwartz (2007). While this bound holds only for stochastic gradient descent, we strongly suspect that it also holds for the constrained estimator (the regularized empirical risk minimizer). Putting together all the pieces yields the following coarse approximate form to the risk: EC ~ min B 2 , O ~ C2 ~ +C 2 ~ ~ ,d . (26) Both analyses provide complementary but incomplete views of the learning curve. In this paper, we used high-dimensional asymptotics to obtain an asymptotically exact analysis also when En is away from zero, albeit for simple problems. The key is that as n grows with n, the appropriate ratios between sample size and complexity are maintained, while still allowing us reap the benefits of asymptotics, namely concentration. Such ideas have been around in statistics since Kolmogorov's work in the 1960s, and more recently have played an important role in high-dimensional sparse settings (e.g., Wainwright (2009)). Related ideas can also be found in statistical physics approaches for studying learning curves (Haussler et al., 1994). The particular focus in this paper has been on understanding how multiple problem complexities interact to generate multiple regimes in learning curves. We have so far characterized the regimes for two problems--mean estimation and componentwise linear regression­as a starting point. We hope future work will help shed light on learning curves in more general settings. We emphasize that (26) is purely speculative. Nevertheless, it can help us understand what might change from the componentwise analysis. Comparing (26) ~ with (23), we see an additional factor of BC d that ~ might correspond to the benefit of specializing to a diagonal covariance. We also notice the additional ad~2 1 ditive term C2 , which behaves as n and is the relevant ~ term when 2 is large. This additional additive term ~ gives rise to a fourth regime. While the componentwise estimator has three regimes, (26) suggests the full estimator has four regimes, with two regimes controlled only by the norm, independent of the dimensionality. Further work is necessary to confirm these hypotheses. References Bartlett, P. L. and Mendelson, S. Rademacher and Gaussian complexities: Risk bounds and structural results. In Computational Learning Theory, pp. 224­240, 2001. Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, Cambridge, UK, 2006. Haussler, D., Kearns, M., Seung, H. S., and Tishby, N. Rigorous learning curve bounds from statistical mechanics. In Computational Learning Theory, pp. 76­87, 1994. Liang, P., Bach, F., Bouchard, G., and Jordan, M. I. Asymptotically optimal regularization in smooth parametric models. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2010. MIT Press. Pollard, D. Convergence of Stochastic Processes. SpringerVerlag, 1984. Shalev-Shwartz, S. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007. Srebro, N., Sridharan, K., and Tewari, A. Stochastic optimization and online learning with smooth loss functions. Technical report, TTI Chicago, 2010. van der Vaart, A. W. Asymptotic Statistics. Cambridge University Press, Cambridge, UK, 1998. Wainwright, M. J. Sharp thresholds for noisy and highdimensional recovery of sparsity using 1 -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55:2183­2202, 2009. 4. Discussion Our broad goal is to obtain an accurate picture of the learning curve. There are a plethora of approaches in the literature that tackle pieces of the curve. Classical parametric asymptotics, a dominant approach in statistics, let the sample size n while fixing the problem specification . Hence, they consider the limit of the learning curve where the excess risk P En - 0. These analyses thus focus on the local fluc tuations of estimators around a limiting value. As a result, norm constraints do not enter into the asymptotic risk, even with considering higher-order asymptotics (e.g., Liang et al. (2010)). On the other hand, finite sample complexity bounds (e.g., Bartlett & Mendelson (2001)) provide statements for any sample size n and problem specification n . These focus on controlling structural complexities. Thus, they are well suited for handling norm constraints and typically yield dimension-free results. However, these are only upper bounds and can be far from being tight.