Efficient Euclidean Projections in Linear Time

Jun Liu J . LIU @ ASU . EDU Jieping Ye JIEPING . YE @ ASU . EDU Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA

Abstract
We consider the problem of computing the Euclidean projection of a vector of length n onto a closed convex set including the 1 ball and the specialized polyhedra employed in (ShalevShwartz & Singer, 2006). These problems have played building block roles in solving several 1 norm based sparse learning problems. Existing methods have a worst-case time complexity of O(n log n). In this paper, we propose to cast both Euclidean projections as root finding problems associated with specific auxiliary functions, which can be solved in linear time via bisection. We further make use of the special structure of the auxiliary functions, and propose an improved bisection algorithm. Empirical studies demonstrate that the proposed algorithms are much more efficient than the competing ones for computing the projections.
x
C,

x
x1 x 2
C

C

x2
x1
0, x 2

0, x1

x

x
0 C

x1

C

x2
Figure 1. Illustration of the set G2 in the three dimension case (p = 2, n = 3). G2 is the region that is bounded by the following three lines: 1) x1 = 0, x2 = x; 2) x2 = 0, x1 = x; and 3) x = C, x1 + x2 = C.

vex sets: (see Fig. 1 for an illustration of G 2 ): G1 G2 = = {x  Rn | x 1  z}, (2) n p n-p {x = (x, x)  R , x  R , x  R | x  0, x  0, x
1

1. Introduction
The Euclidean projection of a vector v  R onto a set G  Rn is defined as: G (v) = arg min
xG n

= x

1

 C},

(3)

1 x - v 2, 2

(1)

where . is the Euclidean ( 2 ) norm. Since the objective function in (1) is strictly convex, its solution is unique for a closed and convex set G. When the set G is simple, e.g., the hyperplane, the halfspace, and the rectangle, the problem in (1) has an analytical solution (Boyd & Vandenberghe, 2004). However, for a general closed and convex set G, the problem in (1) does not admit an analytical solution. For example, when G is a general polyhedra, (1) leads to a Quadratic Programming problem. In this paper, we address the problem of computing the Euclidean projection onto the following two closed and conAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

where . 1 denotes the 1 norm, and z > 0 and C > 0 denote the radiuses of the . 1 balls. These two Euclidean projections have played building block roles in solving several 1 -norm based sparse learning problems (Tibshirani, 1996; Koh et al., 2007; Ng, 2004; Duchi et al., 2008; Shalev-Shwartz & Singer, 2006; Shalev-Shwartz, 2007). The Euclidean projection onto the 1 ball (G1 ) can be applied to solve the 1 ball constrained learning problem:
x: x

min

1 z

loss(x),

(4)

where loss(.) is a given convex loss function. For example, setting loss(.) to the least squares loss leads to the well-known Lasso problem (Tibshirani, 1996); and setting loss(.) to the empirical logistic loss leads to the 1 ball constrained logistic regression problem (Koh et al., 2007). The use of the 1 ball constraint (or equivalently the 1 norm regularization) results in sparse solutions and empirical success in various applications (Cand` s & Wakin, e

Efficient Euclidean Projections in Linear Time

2008; Donoho, 2006; Ng, 2004; Koh et al., 2007; ShalevShwartz & Srebro, 2008; Tibshirani, 1996). To solve (4) in the large-scale scenario, one may rely on the firstorder methods--those using at each iteration function values and (sub)gradients only. Well-known first-order methods include subgradient descent, gradient descent, and Nesterov's optimal method (Nesterov, 2003; Nemirovski, 1994). When applied to solve (4), one key building block is the Euclidean projection onto the 1 ball. Duchi et al. (2008) proposed two algorithms for solving this projection. The first algorithm is motivated by the work of (ShalevShwartz & Singer, 2006; Shalev-Shwartz, 2007), and it works by sorting the elements of the vector, and then obtaining the projection by thresholding. The resulting algorithm has a time complexity of O(n log n). The second algorithm is based a modification of the randomized median finding algorithm (Cormen et al., 2001), and it has an expected (not the worst-case) time complexity of O(n). The Euclidean projection onto the specialized polyhedra G2 was studied in (Shalev-Shwartz & Singer, 2006) in the context of learning to rank labels from a feedback graph. Shalev-Shwartz and Singer (2006) reformulated their proposed model as a Quadratic Programming problem subject to a set of affine constraints, in which the projection onto G2 is a key building block. To solve this projection, Shalev-Shwartz and Singer (2006) proposed to first sort the elements of the vectors v and v, then solve a piecewise quadratic minimization problem, and finally obtain the solution by thresholding. The resulting algorithm has a time complexity of O(n log n). In this paper, we propose to cast both Euclidean projections as root finding problems associated with specific auxiliary functions. Based on such reformulations, we propose to solve both problems using bisection, which has a (worstcase) linear time complexity. We further make use of the special structure of the auxiliary functions, and propose an improved bisection algorithm. Empirical studies demonstrate the efficiency of the proposed algorithms in comparison with existing algorithms. Notation: Vectors are denoted by lower case bold face letters, e.g., x  Rn is an n-dimensional vector. The i-th element of x is denoted by x i . . denotes the Euclidean ( 2 ) norm, and . 1 denotes the 1 norm. Organization: We cast both Euclidean projections as root finding problems in Section 2, propose efficient projection algorithms in Section 3, report empirical results in Section 4, and conclude this paper in Section 5.

2.1. Projection onto the

1

Ball
1

The problem of Euclidean projections onto the can be formally defined as: G1 (v) = arg
x: x

ball G1

min

1 z

1 x - v 2. 2

(5)

Introducing the Lagrangian variable  for the constraint x 1  z, we can write the Lagrangian of (5) as L(x, ) = 1 x-v 2
2

+ ( x

1

- z).

Let x be the primal optimal point, and  be the dual optimal point. The primal and dual optimal points x and  should satisfy x 1  z and   0. Moreover, the 1 ball constraint in (5) satisfies Slater's condition (Boyd & Vandenberghe, 2004, Section 5.2.3) since 0 1 < z. Therefore, strong duality holds, the primal and dual optimal values are equal, and we have the complementary slackness condition: (6)  ( x 1 - z) = 0. We show how to compute the primal optimal point x when the dual optimal point  is known. x is the optimal solution to the following problem: x = arg min L(x,  ).
x

(7)

The problem in (7) has a unique solution, since L(., .) is strictly convex in the first argument. Since the variables in (7) are decoupled, we have 1 xi = arg min (xi - vi )2 +  (|xi | - z), xi 2 which leads to xi = sgn(vi ) max(|vi | -  , 0), (8)

where sgn(t) is the signum function: if t > 0, sgn(t) = 1; if t < 0, sgn(t) = -1; and if t = 0, sgn(t) = 0. Our methodology for solving the problem (5) is to first solve the dual optimal point  , with which we can obtain the primal optimal point x based on (8). We consider the following two cases: v 1  z and v 1 > z. We show in the following lemma that for the first case, we have  = 0: Lemma 1 If v 1  z, then the dual optimal point  is zero and the primal optimal point x is given by x = v. Proof: We first prove  = 0 by contradiction. Assume that  > 0. It follows from (8) that x
1

2. Reformulation as Root Finding Problems
In this section, we reformulate both Euclidean projections as root finding problems using the Lagrangian technique.

< z,

thus  ( x 1 - z) = 0, which contradicts with the complementary slackness condition in (6). Therefore,  = 0. It follows from (8) that x = v.

Efficient Euclidean Projections in Linear Time

Next, we focus on the case when v 1 > z. We show that  can be obtained by computing the root of an auxiliary function, as summarized in the following theorem: Theorem 1 If v 1 > z, then the dual optimal point  is positive, and  is given by the unique root of
n

conditions are primal and dual optimal (Boyd & Vandenberghe, 2004, Chapter 5.5.3). The KKT conditions for (10) are given by xi  0, xi = v i -  -  + i , xj  0, xj = v j +  + j , i  0, i xi = 0, j  0, j xj = 0,
p p

(11) (12) (13) (14)

f () =
i=1

max(|vi | - , 0) - z.

(9)

  0,  (
i=1

xi - C) = 0,
i=1 p

xi  C,
n-p

Proof: We first prove that the auxiliary function f (.) has a unique root, and then prove that  > 0 is the root of f (.). Denote the maximal absolute element in v by v max , that is, vmax = maxi |vi |. It is clear that for any i, max(|v i |-, 0) is continuous and monotonically decreasing in (-, +) with respect to , and strictly decreasing in (-, |v i |]. Thus, f (.) is continuous and monotonically decreasing in (-, +), and strictly decreasing in (-, v max ]. From (9), we have f (0) > 0 (since v 1 > z), f (vmax - z)  0, and f (vmax ) = -z < 0. According to the Intermediate Value Theorem, f (.) has a unique root lying in the interval [max(0, vmax - z), vmax ). Next, consider the dual optimal point  . First, we show that  must be positive. Otherwise, if  = 0, we have x = v from (8), and x 1 = v 1 > z, which contradicts with x 1  z. It follows from the complementary slackness condition in (6) that x 1 = z. Following (8), we have f ( ) = 0. 2.2. Projection onto the Specialized Polyhedra The Euclidean projection onto the specialized polyhedra G2 can be formally defined as: G2 (v) = arg min
x=(x,x)

xi =
i=1 j=1

xi ,

(15)

for all i = 1, 2, . . . , p, and j = 1, 2, . . . , n - p. We show in the following lemma that the KKT conditions (11-15) can be simplified: Lemma 2 The conditions in (11-13) are equivalent to: xi xj = = max(v i -  -  , 0), max(v j +  , 0). (16) (17)

Proof: First, we show that if (11-13) hold, then (16-17) hold. If v i -  -  > 0, we have xi > 0 from (11), i = 0 from (13), and thus x i = v i - - by using (11); if v i -  -   0, we have xi = 0, because if xi > 0, we have i > 0 from (11), and x i = 0 from (13), leading to a contradiction. Therefore, (16) holds. Following similar arguments, we can obtain (17). Next, we assume (16-17) hold. By constructing  i = xi - (v i -  -  ) and j = xj - (v j +  ), we can verify that (11-13) hold. Based on Lemma 2, the KKT conditions (11-15) can be simplified as (14-17). In the following discussions, we focus on computing the primal and dual optimal points by the simplified KKT conditions. We define the following three auxiliary functions:
p

1 x-v 2

2

+

1 x-v 2

2

,

s.t. x  0, x  0, xT e = xT e  C (10) where v = (v, v), x, v, e  Rp , x, v, e  Rn-p , and the elements of e and e are all 1's. Introducing the Lagrangian variables , ,  and  for the constraints xT e = xT e, x  0, x  0 and xT e  C, respectively, we can write the Lagrangian of (10) as 1 1 L(x, x, , , ) = x - v 2 + x - v 2 + 2 2 (xT e - xT e)+(xT e - C) - T x -  T x. Let x = (x , x ) be the primal optimal point, and  ,  ,  and  be the dual optimal points. It is easy to verify that the objective function in (10) is convex and differentiable, and the constraint functions are differentiable, thus the duality gap is zero, and any points that satisfy the KKT

g() g() g()

=
i=1 n-p

max(v i - , 0) - C, max(v j + , 0) - C,
j=1

(18)

=

(19) (20)

= g() - g().

Using similar arguments as in the proof of Theorem 1, we obtain the following properties of these functions: Lemma 3 Denote v max = maxi v i and v max = maxj v j . i) g(.) is continuous and monotonically decreasing in (-, +), and strictly decreasing in (-, v max ];

Efficient Euclidean Projections in Linear Time

ii) The root of g(.) is unique and lies in [v max - C, v max ); iii) g(.) is continuous and monotonically increasing in (-, +), and strictly increasing in [-v max , +); iv) The root of g(.) is unique and lies in (-v max , -v max + C]; v) g(.) is continuous and monotonically decreasing in (-, +), and strictly increasing in both (-, v max ] and [-v max , +). vi) g(.) has at leat one root. Moreover, if v max > -v max , then the root of g(.) is unique and lies in (-v max , v max ). We summarize the main results of this section in the following theorem: Theorem 2 Denote the unique roots of g(.) and g(.) by  and , respectively. i) If   , then by setting  = ,  =  - , and the primal points according to (16) and (17), the simplified KKT conditions hold; ii) If  < , then by setting  = 0,  as a root of g(.), and the primal points according to (16) and (17), the simplified KKT conditions hold. Moreover, if v max > -v max , then  is the unique root of g(.); and if v max  -v max , then  can be any element in [v max , -v max ], and meanwhile the primal optimal point x is a zero vector. Proof: In both cases, (16) and (17) hold, so we only need to verify the conditions in (14) and (15). We first prove i). Since  and  are the roots of g(.) and g(.), respectively, we have g() = 0 and g() = 0. Since we set  =  and  =  - , we have  +  = , and g( +  ) = 0. It follows from (16) and (18) that p n-p i=1 xi = C. Similarly, we can verify that j=1 xi = C. Therefore, (15) holds. Since  =  -   0 due to p   , and i=1 xi = C, we verify (14). Next, we prove ii). We first show that   (, ). According to the second property in Lemma 3, we have  < v max ; similarly, we have  > -v max , according to the fourth property in Lemma 3. From the first property in Lemma 3, g(.) is strictly decreasing in (-, v max ] and monotonically decreasing in (-, +), thus g() < g() = 0. (21)

Since g(.) is continuous and monotonically decreasing (see the fifth property in Lemma 3), we have   (, ). Following the similar arguments for obtaining (21), we have p g( ) < 0. Since we set  = 0, we have i=1 xi = p i=1 max(v i -  , 0) = g( ) + C < C by using (16) and (18). Therefore, (14) holds. It follows from (16-20) together with g( ) = 0 and  = 0 that (15) holds. From the sixth property in Lemma 3, the root of g(.) is unique, if v max > -v max . If v max  -vmax , then following (18) and (19), we have g() = g() = -C and g() = 0,   [v max , -vmax ]. Meanwhile, from (16) and (17), we have x i = xj = 0, i, j, so that the primal optimal point x is a zero vector. Following Theorem 2, we propose the following procedure for solving (10). First, we compute  and , the unique roots of g(.) and g(.). If   , we set  =  and  =  - ; if  <  and v max > -v max , we set  = 0 and compute  as the unique root of g(.); and if  <  and v max  -v max , we set  = 0 and choose any element in [v max , -v max ] as  . Finally, with the computed dual optimal points, we obtain the primal optimal points from (16) and (17).

3. Efficient Euclidean Projections
We reformulated the Euclidean projection as root finding problems in the last section. In this section, we present efficient algorithms for computing the roots. Specifically, we present the bisection algorithm in Section 3.1, and an improved bisection algorithm in Section 3.2. 3.1. Euclidean Projections by Bisection We first propose to make use of bisection for computing the dual optimal points. Bisection works by producing a sequence of intervals of uncertainty with strictly decreasing lengths. It is known that, for any continuous function with a unique root in the interval [a, b], the number of bisection iterations is upper-bounded by log 2 ( b-a ) , where b - a is  the length of the initial interval of uncertainty and  denotes the pre-specified precision parameter. When applying bisection for solving the roots of f (.), g(.), g(.) and g(.), it costs O(n), O(p), O(n-p) and O(n) floating operations (flops) for evaluating the function values once, respectively. From Lemma 3 and Theorems 1 and 2, the lengths of the initialized interval are upper-bounded by z, C, C and |v max + v max |, respectively, so the bisection iterations are upper-bounded by log 2 (z/) , log2 (C/) , log2 (C/) and log2 (|v max + v max |/) , respectively. Once the dual optimal point(s) have been computed, we can recover the primal optimal point x from (8), (16) and (17) in O(n) flops. Therefore, the time complexity for solving these two Euclidean projections by bisection is O(n).

Similarly, from the third property in Lemma 3, we have g() < g() = 0. It follows from (20), (21), and (22) that g() = g() - g() > 0, g() = g() - g() < 0. (22)

Efficient Euclidean Projections in Linear Time

3.2. Euclidean Projections by Improved Bisection Although bisection can solve the Euclidean projections in linear time, it has the limitation that its efficiency is independent of the function, and it cannot be improved even when the function is "well-behaved". The underlying reason is that, bisection only utilizes the signs of the function at the two boundary points, but not their values. To improve the efficiency, one natural alternative is the interpolation method (Brent, 1971) that has a better local convergence rate than bisection. Well-known linear interpolation methods include Newton's method and Secant which have locally Q-quadratic and Q-superlinear convergence rates, respectively. However, both Newton's method and Secant can diverge. To overcome this limitation, the safeguarded methods (Brent, 1971) have been proposed. The interpolation methods such as Newton's method, Secant and their safeguarded versions are developed for solving the general purpose root finding problems. In this subsection, we aim at developing an efficient improved bisection algorithm for finding the root by explicitly using the "structure" of the auxiliary functions. Due to similarities of these auxiliary functions, we take f (.) as an example in the following discussions. We note that, the two key factors that influence the efficiency of the root finding algorithm are: (1) the cost for evaluating the function value, and (2) the number of iterations. In what follows, we detail how to reduce the cost for evaluating f () in Section 3.2.1 and reduce the number of iterations in Section 3.2.2. For convenience of illustration, we denote u = |v|, that is, ui = |vi |, with which the auxiliary function f (.) in (9) can n be written as f () = i=1 max(ui - , 0) - z. We first reveal the convexity property of f (). Since max(u i -, 0) is convex for all i, the auxiliary function f () is convex, as summarized in the following lemma: Lemma 4 The auxiliary function f () in (9) is convex. 3.2.1. E FFICIENT E VALUATION OF f () In this subsection, we aim to reduce the computational cost for evaluating f (.). Denote R = {i|i  [n], ui > }, we can write f () as
n

f( )

f'( )

0

u(4) u(3)

u(2)

u(1)

0

u(4)

u(3) u(2)

u(1)

-z

Figure 2. Illustration of the auxiliary function f () (left) and its subgradient f () (right).

It is easy to verify that f () is a piece-wise linear function, as illustrated in the left figure of Fig. 2 1 . It is clear that f () is not differentiable at u i , for i = 1, 2, . . . , n. However, as revealed in Lemma 4, f () is convex, and we can define the subgradient (Nemirovski, 1994) of f () as f () = -|R |. (24)

Thus, f () is monotonically increasing and non-positive, and f () = 0 if and only if   u (1) = vmax . The right figure of Fig. 2 illustrates the subgradient f (), which is also a piece-wise linear function. From (23) and (24), we can rewrite f () as: f () = f () + b(), where b() =
iR

(25)

ui - z

is the bias of the piece-wise linear function at . (25) implies that the efficient evaluation of f () lies in the efficient calculation of f () and b(). Let the current interval of uncertainty be [ 1 , 2 ], and f (2 ) and b(2 ) have been computed. We show how to evaluate the value of f () for any   [ 1 , 2 ]. Denote U = {i| < ui  2 }, we can compute f () and b() as f () = -|U | + f (2 ), b() =
iU

ui + b(2 ),

f () =
i=1

max(ui - , 0) - z (23) (ui - ) - z =
iR iR

=

ui - |R | - z,

which shows that we focus on those elements in the interval (1 , 2 ] only for computing the subgradient f () and the bias b() for any   [1 , 2 ]. Note that the number of elements in the interval ( 1 , 2 ] decreases when the iterative procedure proceeds (for example, the length of the interval is decreased by a factor of 2 in each iteration of bisection), thus reducing the computational cost for evaluating f ().
For illustration convenience, we denote u(i) as the i-th order statistic of u, i.e., u(1)  u(2)  . . .  u(n) . However, in the proposed algorithm, we do not need to sort the elements in u.
1

where |R | denotes the number of elements in R  .

Efficient Euclidean Projections in Linear Time
T1( ) T2( )

which satisfies T 1 > 1 since f (1 ) > 0 and f (1 ) < 0. Similarly, when f (2 ) is nonzero, the unique root of T 2 (.) can be computed as: T 2 = 2 - f (2 )/f (2 ). (29)
T2

0

1

T1

2

0

1

2

-z

-z

S( )

Since f (.) is convex and f (.) is the subgradient of f (.), the lines T1 (.) and T2 (.) always underestimate f (.), i.e., f (T 1 )  T1 (T 1 ) = 0, f (T 2 )  T2 (T 2 ) = 0.

0

1

S

2

-z

1

T

S

2

Denote T = max(T 1 , T 2 ). (30) It follows that f (T )  0, and T > 1 . Thus, T forms a tighter lower-bound of the interval of uncertainty than  1 . The third model is based on the line passing through the two points (1 , f (1 )) and (2 , f (2 )): S() = f (2 ) + f (2 ) - f (1 ) ( - 2 ). 2 - 1 (31)

f( 1)>0

f( T) 0

f( S) 0

f( 2)<0

Figure 3. Illustration of the three constructed models and the relationship among the roots of the models. The dashed piecewise linear line denotes f (.), and the solid line is the constructed model.

3.2.2. R EDUCING THE N UMBER OF I TERATIONS To reduce the number of iterations, we propose to employ several models including Newton's method and Secant to obtain some approximate solutions for tightening the interval of uncertainty. We then apply bisection to this tightened interval to obtain a new interval of uncertainty. Our method is an improved bisection algorithm, which effectively integrates Newton's method and Secant in bisection. Our method can decrease the interval of uncertainty by a factor strictly larger than 2 in each iteration. Let the current interval of uncertainty be [ 1 , 2 ] (f (1 ) > 0 and f (2 ) < 0), and the following values have been obtained: the subgradients f (1 ) and f (2 ), the function values f (1 ) and f (2 ), and the biases b(1 ) and b(2 ). We construct three models for approximating f (.) (see Fig. 3). The first model corresponds to the line that passes through (1 , f (1 )) with derivative f (1 ): T1 () = f (1 ) + f (1 )( - 1 ). (26)

Since f (1 ) = f (2 ) (note that, f (1 ) > 0, f (2 ) < 0), the unique root of S(.) can be written as S = 2 - f (2 ) 2 - 1 , f (2 ) - f (1 ) (32)

where S < 2 , since f (2 ) f (2 -1 1 ) > 0. From the 2 )-f ( convexity of f (.), we have f (S ) =f f (2 ) -f (1 ) 1 + 2 f (2 ) - f (1 ) f (2 ) - f (1 )

 0. Thus, S forms a tighter upper-bound of the interval of uncertainty than 2 . Since f (.) is monotonically decreasing, we obtain the following relationship among  1 , 2 , T , and S (see Fig. 3): 1 < T  S < 2 , where T = S if and only if f ( T ) = f (S ) = 0. With the computed tighter interval of uncertainty [ T , S ], we can choose  used in bisection as the middle point of T and S : 1 (34)  = (T + S ). 2 The updated interval of uncertainty is [,  S ] if f () > 0, and [T , ] if f () < 0. The length of interval of uncertainty is decreased by a factor strictly larger than 2, since  - T = S -  = 1 1 (S - T ) < (1 - 2 ). (35) 2 2 (33)

The second model corresponds to the line that passes through (2 , f (2 )) with derivative f (2 ): T2 () = f (2 ) + f (2 )( - 2 ). (27)

When 1 = ui for any i, T 1 (.) is the tangent line of f (.) at 1 . Similarly, when 2 = ui for any i, T 2 (.) is the tangent line of f (.) at 2 . Based on the definition of f () in (24) and  1 < vmax , we have f (1 ) < 0. Therefore, the unique root of T 1 (.) is T 1 = 1 - f (1 )/f (1 ), (28)

Efficient Euclidean Projections in Linear Time

Table 1. Illustration of the improved bisection algorithm: each row corresponds to an iteration; [1 , 2 ] denotes the current interval, and |U | denotes its size; T is computed from the two models T1 (.) and T2 (.); S is computed from the model S(.);  is the middle point of T and S ; and the found root is in bold. |U | 1 T  S 2 105 0 0.79907 2.49512 4.19116 4.19641 1242 2.49512 2.72716 3.24086 3.75455 4.19116 502 2.72716 2.85927 2.93296 3.00665 3.24086 88 2.85927 2.89934 2.90139 2.90343 2.93296 1 2.89934 2.90105 2.90105 2.90105 2.90139

the interval [-1, 1]. We implement the proposed projection algorithms in C, and carry out the experiments on an Intel (R) Core (TM)2 Duo 3.00GHZ processor. An Illustrative Example We first present an example to illustrate the improved bisection algorithm. In this experiment, we compute the Euclidean projection onto the 1 ball on a problem of size n = 10 5 . We generate v from the normal distribution, and set z = 100. The result is presented in Table 1. We can observe from this table that the proposed improved bisection converges quite fast, and the computational cost (proportional to |U |) for evaluating f (.) decreases rapidly.
Uniform Distribution, z=100 45 45 40 Average Number of Iterations ibis1 ibis2 mrmf bis 35 30 25 20 15 10 5
4 5 6 7

3.2.3. D ISCUSSIONS The improved bisection algorithm enjoys the following two properties: (1) consistently decreasing computation cost for evaluating f (.) with increasing iterations; and (2) fewer iterations than bisection, benefited by the good local convergence rate of Newton's method and Secant. The improved bisection can also allow an initial guess of the root (denoted by  0 ), which can help reduce the number of iterations, if it is close to our target. Let the initialized interval be [1 , 2 ], we can easily incorporate  0 into the improved bisection algorithm, by setting  1 = 0 if f (0 ) > 0 and 2 = 0 otherwise. We note that, when applying the Euclidean projections for solving problems such as (4), the adjacent Euclidean projections usually have close dual optimal points. Therefore, we can use the root found in the previous projection as the "warm" start. In deriving the improved bisection algorithm, we only make use of the piecewise linear and convex "structures" of the auxiliary function, and thus the improved bisection is applicable to g(.) and g(.), which enjoy these two "structures". By some careful deductions, this improved bisection algorithm can also be extended to solve the root of g(.). The key observation is that g(.) is the difference of the two convex and piecewise linear functions g(.) and g(.), so that we can efficiently evaluate g(.) similar to f (.). Moreover, following similar ideas in Section 3.2.2, we can construct models to obtain  T and S to tighten the interval of uncertainty as follows. Let [ 1 , 2 ] be the current interval of uncertainty. We obtain  T as the intersection of the tangent line of g(.) at  1 and the secant model of g(.) passing through ( 1 , g(1 )) and (2 , g(2 )), and S as the intersection of the tangent line of g(.) at  2 and the secant model of g(.) passing through ( 1 , g(1 )) and (2 , g(2 )).
40 Average Number of Iterations 35 30 25 20 15 10 5 0 3 10

Normal Distribution, z=100

ibis1 ibis2 mrmf bis

10

10 n

10

10

0 3 10

10

4

10 n

5

10

6

10

7

Figure 4. Comparison of the average number of iterations over 1000 runs. v is generated from the uniform distribution in the left figure, and from the normal distribution in the right figure.

Number of Iterations We compare the improved bisection (ibis) with bisection (bis), and the modified randomized median finding (mrmf) (Duchi et al., 2008), in terms of the number of iterations for solving the projection onto the 1 ball. For ibis, we try two different settings: (1) ibis1, which does not require an initial guess of the root, (2) ibis2, which employs the "warm" start, that is, the root found by the previous problem is used as a `warm" start (we solved 1000 different problems for a fixed size n). We set z = 100, and report the results in Figure 4, where the average number of iterations over 1000 runs is shown. We can observe from these figures that: 1) the number of iterations by bisection is around 40; 2) the number of iterations by mrmf is significantly smaller than that of bisection, which validates the good practical behavior of the randomized median finding algorithm; 3) the number of iterations for ibis1 is within 7, which is less than that required by mrmf; and 4) by employing the "warm" start technique, the number of iterations for ibis2 is further reduced to about 2. Computation Efficiency We report the total computational time (in seconds) for solving 1000 independent projection onto the 1 ball by different methods in Tables 2 and 3, from which we can observe that, all methods scale (roughly) linearly with n, and ibis1 is more efficient than bisection and mrmf. With a "warm" start technique, ibis2 is much more efficient than ibis1. We also compare the improved bisection algorithm with the soft projections onto polyhedra (sopopo) proposed in

4. Experiments
To study the performance of the proposed projection algorithms, we randomly generated the input vector v according to two distributions: (1) normal distribution with mean 0 and standard deviation 1, and (2) uniform distribution in

Efficient Euclidean Projections in Linear Time

Table 2. The total computational time (in seconds) for solving 1000 independent projections onto the 1 ball: normal distribution with z = 10 (top half) and z = 100 (bottom half). n 103 104 105 106 107 bis 0.0543 0.4323 4.691 78.35 788.9 mrmf 0.0130 0.2720 2.776 37.68 380.3 ibis1 0.0074 0.1276 1.509 19.62 196.2 ibis2 0.0024 0.0877 1.126 17.06 167.9 bis 0.1521 0.6178 4.926 78.52 790.3 mrmf 0.0319 0.2901 2.766 37.84 383.2 ibis1 0.0305 0.1843 1.541 19.61 196.5 ibis2 0.0195 0.0946 1.133 17.06 167.8

Table 4. The total computational time (in seconds) for solving 1000 independent projections onto the specialized polyhedra G2 (we set p = n/2 and C = 10): normal distribution (top half) and uniform distribution (bottom half). n 103 104 105 106 107 sopopo 0.0934 0.8401 9.574 142.5 1593 ibis1 0.0274 0.1482 1.774 22.77 226.9 ibis2 0.0200 0.0920 1.249 18.50 185.2 sopopo 0.1147 0.9077 10.07 143.5 1605 ibis1 0.0288 0.1725 2.084 24.73 258.2 ibis2 0.0216 0.1002 1.364 18.57 191.4

Acknowledgments
Table 3. The total computational time (in seconds) for solving 1000 independent projections onto the 1 ball: uniform distribution with z = 10 (top half) and z = 100 (bottom half). n 103 104 105 106 107 bis 0.1247 0.7511 6.554 82.61 833.6 mrmf 0.0332 0.2941 2.992 37.99 389.0 ibis1 0.0286 0.1698 2.091 24.74 247.7 ibis2 0.0210 0.0946 1.332 17.46 173.7 bis 0.2030 1.1644 8.165 86.31 844.4 mrmf 0.0332 0.3373 3.187 39.43 394.7 ibis1 0.0266 0.1859 2.159 24.90 248.5 ibis2 0.0198 0.1135 1.416 17.48 175.4

This work was supported by NSF IIS-0612069, IIS0812551, CCF-0811790, and NGA HM1582-08-1-0016.

References
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. Brent, R. (1971). An algorithm with guaranteed convergence for finding a zero of a function. Computer Journal, 14, 422�425. Cand` s, E., & Wakin, M. (2008). An introduction to come pressive sampling. IEEE Signal Processing Magazine, 25, 21�30. Cormen, T., Leiserson, C., Rivest, R., & Stein, C. (2001). Introduction to algorithms. MIT Press. Donoho, D. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52, 1289�1306. Duchi, J., Shalev-Shwartz, S., Singer, Y., & Tushar, C. (2008). Efficient projection onto the l 1 -ball for learning in high dimensions. International Conference on Machine Learning (pp. 272�279). Koh, K., Kim, S., & Boyd, S. (2007). An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine Learning Research, 8, 1519�1555. Nemirovski, A. (1994). Efficient methods in convex programming. Lecture Notes. Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers. Ng, A. (2004). Feature selection, 1 vs. 2 regularization, and rotational invariance. International Conference on Machine Learning (pp. 78�85). Shalev-Shwartz, S. (2007). Online learning: Theory, algorithms, and applications. Doctoral dissertation, Hebrew University. Shalev-Shwartz, S., & Singer, Y. (2006). Efficient learning of label ranking by soft projections onto polyhedra. Journal of Machine Learning Research, 7, 1567�1599. Shalev-Shwartz, S., & Srebro, N. (2008). Iterative loss minimization with 1 -norm constraint and guarantees on sparsity (Technical Report). TTI. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58, 267�288.

(Shalev-Shwartz & Singer, 2006) for solving the projection onto the specialized polyhedra G 2 . The results are presented in Table 4. We can observe from the table that ibis1 and ibis2 are more efficient than sopopo. The experimental results verify the efficiency of the proposed algorithms.

5. Conclusion
In this paper, we study the problem of Euclidean projections onto the 1 ball G1 and the specialized polyhedra G 2 . Our main results show that both Euclidean projections can be formulated as root finding problems. Based on such reformulation, we can solve the Euclidean projections in (the worst-case) linear time via bisection. We further explore the piecewise linear and convex "structures" of the auxiliary functions, and propose the improved bisection algorithm. Empirical studies show that our proposed algorithms are much more efficient that the competing ones. We are currently investigating the 1 ball constrained sparse learning problems by the first-order methods, which include the proposed Euclidean projections as a key building block. We plan to extend the proposed algorithms to efficiently solve the projection  G (v + ) when the result of G (v) is known and  has sparse structure, which can be useful in the scenario of online learning (Duchi et al., 2008; Shalev-Shwartz, 2007). We also plan to explore efficient entropic projections (Shalev-Shwartz, 2007), which uses the entropy instead of the Euclidean norm in (1).