Learning When to Stop Thinking and Do Something!

Barnab´s P´czos a o poczos@cs.ualbeta.ca Yasin Abbasi-Yadkori yasin.abbasi@gmail.com Csaba Szepesv´ri a szepesva@cs.ualberta.ca Russell Greiner greiner@cs.ualberta.ca Nathan Sturtevant nathanst@cs.ualberta.ca Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8 CANADA

Abstract
An anytime algorithm is capable of returning a response to the given task at essentially any time; typically the quality of the response improves as the time increases. Here, we consider the challenge of learning when we should terminate such algorithms on each of a sequence of iid tasks, to optimize the expected average reward per unit time. We provide a system for addressing this challenge, which combines the global optimizer CrossEntropy method with local gradient ascent. This paper theoretically investigates how far the estimated gradient is from the true gradient, then empirically demonstrates that this system is effective by applying it to a toy problem, as well as on a real-world face detection task.

or detecting possible terrorists in airports. In all of these situations, we assume an anytime system has already been designed for the task (e.g., sorting the mail), which basically runs a series of subroutines A1 (·), A2 (·), . . . on each instance (e.g., each envelope); we assume that later subroutines generally return successively higher quality responses. This paper addresses the challenge of learning a "stopping policy": That is, after the k th subroutine Ak (Xt ) has terminated on instance Xt , after spending time Xt Ak = tk , the stopping policy has to decide whether to accept that algorithm's decision about Xt (that is, decide to route Xt to Ak (Xt )) and then proceed to process instance Xt+1 , or instead to consider running the next subroutine Ak+1 on this instance Xt (which means it will return the value of Ak+1 (Xt ), or perhaps Ak+2 (Xt ). . . ). To help make this decision, Ak (Xt ) returns some information that can help the stopping policy. Such information might indicate whether the envelope was handwritten or typed, information about the layout of the address field or any other information that became available when the subroutine processed the envelope. For each instance Xt , once the system has accepted the decision of a subroutine, say, at the Lt th -stage of the thinking process, the overall system (and hence the stopping policy) receives a (real-valued) reward, rt = r( Xt , ALt )  , based on whether the letter Xt was routed correctly. This is after the system has spent some time "thinking" about this instance, which is the sum of the times required by the subrouLt tines A1 , . . . , ALt on this instance, k=1 tk . The time spent with the decision to stop at the k th -stage is included in tk . Also, the reward function r = r(x, a) is not available for the decision maker (otherwise, the optimal action would be easy to compute, at least when the action space is not large). Our ultimate goal is to maximize the average reward, which is the accumu-

1. Introduction and Motivation
A mail sorter will scan each envelope as it passes by, trying to find and read the zip code. If it misreads the zip code, that mail item may be misdelivered, which incurs a penalty. It is also important for the sorter to be efficient, as there are many remaining mail items queued up -- and spending too long on one envelope will delay the others, and possibly prevent some from being delivered. . . which is also problematic (Russell & Subramanian, 1995). Here, it is important to be both accurate and efficient. This is true of many other situations -- finding and classifying parts in a factory assembly line, finding and identifying car license plates as cars drive in a freeway,
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Learning When to Stop Thinking

lated reward on the envelope sequence divided by the time we spent "thinking" about the envelopes. Therefore, we have to consider both the reward we can accumulate, as well as the thinking time required. Obviously, our problem can be viewed as finding an optimal policy in an average-reward partially observable Markovian Decision Problem (MDP) (Sutton & Barto, 1998), where the actions have different (possibly random) durations: The hidden state (in one formalization of the problem) is the optimal action arg maxa r(Xt , a) (i.e., the zip-code), the observations are the information that is gathered while processing the instance, including the decision of the subroutines; the actions are `stop thinking' or 'continue'; and the reward is based on whether the letter was correctly routed. (A formal description appears in Section 2). Given that the full scanned image of an envelope is available in the computer, one may as well model the problem as a fully observable MDP, where the state is Xt . This is promising since powerful value-function based approaches tend to work poorly in partially observable environments when state aliasing is a problem. However, by the very nature of our problem it seems that practical value-function based methods would themselves need to introduce significant state aliasing: Since we want to keep the processing times small, we cannot allow arbitrary complex procedures when computing the values of actions for some input Xt . Indeed, all standard value function computations would extract some finite (small) representation of Xt and obtain the values based on this representation. Given the complicated nature of Xt , significant state aliasing seems then unavoidable. Therefore, in this paper we consider policy search approaches. More precisely, we investigate both generic policy search based on the Cross-Entropy (CE) method (Rubinstein & Kroese, 2004) and policy gradient methods (Williams, 1992; Baxter & Bartlett, 2001), and their combination, given some fixed parametric class of policies. The main problem with policy gradient methods is that the variance of the gradient estimate is hard to control. Due to the aforementioned difficulties related to the use of value-functions, we do not attempt to employ value-function based approaches to reduce the variance, which usually is essential to achieve good performance (Greensmith et al., 2004). Thanks to the special structure of our problem, we will show how to obtain reasonable estimates of the gradient. This observation is one of the main contributions of the paper. We prove a bound on the quality of the estimate of the gradient as a function of the number of instances processed and propose a stopping rule for deciding if the gradient estimate is sufficiently

good. We also demonstrate that these approaches are effective using empirical studies on a simple artificial example, as well as on a real-world face detection application. After Section 2 formally defines our problem, Section 3 provides two efficient learning algorithms, and summarizes our empirical studies to evaluate these algorithms. We close this section with a brief overview of related work. Related work: As mentioned above, as an anytime algorithm (Zilberstein, 1996) is working on a problem, it can be interrupted; if so, it will then return its current best answer. Note this model assumes the external world will terminate the process (e.g., when the current envelope is gone and a new one appears in its place); by contrast, our task is to design a policy that decides itself when to terminate. Russell and Subramanian (1995) consider "bounded rational agents" in this anytime framework. Their work focuses on the challenge of building this sequence of algorithms, given the relevant information about the world. Our paper, by contrast, describes a way to learn how to make decisions based on the available information. Turney (2000) summarizes many different types of "cost" in the context of learning. Here, we are considering only the cost of the learned system, in terms of time required to produce an accurate response; we are not explicitly considering the cost of the learning process. Many others have considered this challenge of learning a system that must be both efficient and accurate: Greiner et al. (2002) defines an "active classifier" as a process that can sequentially acquire information about an instance before rendering a decision (e.g., performing some sequence of tests on a patient before declaring that patient to be healthy or sick); an optimal such active classifier will minimize the expected cost of collecting this information plus the expected penalty for misclassification. They observe an active classifier corresponds to an "action policy" (given the results of the previous tests, take some specific action; here this corresponds to a decision tree), and then consider (PAC)learning the optimal such policy in various settings. By contrast, our system is only thinking about an instance, and is not making decisions about what extra information to collect (information collection happens independently); moreover, we are not learning the optimal action policy, but instead considering the optimal "stopping policy", wrt a fixed action policy. (We note that our approach naturally extends to this case and also to the case when information collection has a price.) Moreover, their

Learning When to Stop Thinking

model requires knowing a model of the actions in advance; our approach does not. Finally, while their approach is basically a "batch learner", our system can continuously improve based on observations. Optimal stopping has been extensively investigated in the statistics and economics literature (Shiryaev, 2007). Using their terminology, we are exploring a Bayesian optimal stopping problem, where the prior distribution is over the instances (the distribution of envelopes). However, the focus in this literature is the existence of optimal stopping rules (often by reducing the problem to a MDP), or obtaining closed form solutions in specific problems. By contrast, here we are trying to use a set of instances to learn a good (not necessarily optimal) stopping policy. Finally, we note that Viola and Jones (2001) were also interested in producing an efficient face detection system by applying a sequence of successively more expensive classifiers. That system used only a single simple type of base classifier (linear separator), and had a one-sided stopping critera (stop as soon as any baseclassifier says "no"); by contrast, we consider general classifiers, and allow termination on either "yes" or "no" decisions.

of the k th information stage to the action space, determining what action we would take if we quit at stage k; hence At = µLt (Yt,Lt ) is the action taken on instance Xt . (This corresponds to sending the envelope to the destination determined by At .) Figure 1 summarizes the whole process. The performance criterion for policy q = (q1 , . . . , qK ) is defined as the expected average reward per time step: q = E lim inf
t t s=1

r(Xs , As )
t s=1

Ts

,

(1)

where r : X × A  R is the reward function, which is assumed to be uniformly bounded. Here, q 's dependence on q is defined through As and Ts , whose distributions are each determined by q. We also assume that Ts is bounded away from zero with probability one and is bounded from above with probability one -- i.e., there is a minimal and maximal thinking time. Our challenge is to determine the q that maximizes q . We consider a learning scenario where all quantities involved (including the distribution of Xt and the transition kernels pk and reward r) are unknown, but where we can sample trajectories and observe the rewards.

2. Formal Definition of the Problem
Let X1 , X2 , . . . be an independent and identically distributed (iid) sequence -- e.g., corresponding to the series of envelopes mentioned in the previous section. On the tth step, when first dealing with the instance Xt , we start a "thinking process" that can involve at most K stages. In each stage k (1  k  K), we compute information about Xt , denoted Ytk  Yk , which is the result of the thinking at stage k with respect to Xt . We let tk denote the time used in thinking at stage k. We make the Markov assumption, that the new knowledge Ytk is independent of Yt,1 , . . . , Yt,k-2 given Yt,k-1 and Xt . This is not a restrictive assumption since the domain of Yt,k and Yt,j could be different for k = j. In particular, Yt,k may simply include Yt,k-1 , . . . , Yt,1 . Let qk be the stochastic policy that determines whether we terminate the thinking process at stage k. More precisely, at state k, after seeing Ytk , q continues thinking at Xt with probability qk (0|Ytk ), or quits with probability qk (1|Ytk ) = 1 - qk (0|Ytk ). By convention, q will terminate by the K th stage if we have not terminated earlier; i.e., qK (1|YtK ) = 1 for all YtK  YK . Let Lt  {1, . . . , K} denote when we quit for instance Lt Xt , and let Tt = k=1 tk be the total thinking time required to deal with Xt . Let µk : Yk  A be a fixed mapping from the result

3. Learning Stopping Policies
In this section we derive two policy-gradient-based algorithms. Since {Xt } is iid, so are {Rs = r(Xs , As )} t t 1 s=1 r(Xs , As ) s=1 r(Xs , As ) and {Ts }. As = t 1 t , t s=1 Ts s=1 Ts t by the law of large numbers the numerator and the denominator converge (respectively) to E[r(X1 , A1 )] and E[T1 ]. This allows us to rewrite the performance criterion as E[r(X1 , A1 )] , (2) q = E[T1 ] which is much easier to optimize than the original average cost criterion (Equation 1). In particular, this is the "trick" that allows us to obtain gradient estimates relatively quickly without resorting to value functions or other variance reduction techniques. In the next sections we propose two estimates of the gradient of our criterion based on (2) assuming that the parameters of q are collected in the vector  -- i.e., q = q . That is, our goal now is to calculate the gradient of the performance criterion q with respect to . Clearly, using the notation E[r1 ] = E[r(X1 , A1 )],   E[r1 ] =  E[r(X1 , A1 )], E[T1 ] =  E[T1 ],  q   =  E[r1 ] E[T1 ]E[r1 ] - E[r1 ]E[T1 ] = .  E[T1 ] E[T1 ]2

Learning When to Stop Thinking

reward: action: observation: time: Yt,1

r(Xt , µ1 (Yt,1 ))

r(Xt , µ2 (Yt,2 ))

r(Xt , µ3 (Yt,3 ))

r(Xt , µK- (Yt,K- )) 1 1

r(Xt , µK (Yt,K ))

µ1 (Yt,1 )

µ2 (Yt,2 )

µ3 (Yt,3 )

µK-1 (Yt,K-1 )

µK (Yt,K )

q1

6 - Yt,2  (0|Y )
t,1

q2

6 - Yt,3  (0|Y )
t,2

q3

 t,1

-  t,2

-  t,3

6 - ··  · Yt,K-1 (0|Y ) q
t,3

K- 1

6  Yt,K (0|Y )
t,K- 1

6 qK (1|Yt,K  )=1

-

 t,K-1

-  tK-

Figure 1. Summary of the Overall Process

Our gradient estimates will be based on this formula: we estimate each quantity in this formula individually and then combine the estimates. 3.1. Direct Gradient Ascent We can estimate the expectation of both T1 and r(X1 , A1 ) by using the sample means based on n instances: n Lt n 1 1 k E[r(X1 , A1 )]  Rt . E[T1 ]  n t=1 n t=1
k=1

We can use the Markov property along the trajectory y = (y0 , y1 , q1 , . . . , yk-1 , qk-1 , yk ) to derive ~ ^ ^
(~) y

ln f (~) y

=

ln f0 (y0 ) +
k=1 (~)-1 y

ln pk (yk |yk-1 ) +
(~) ), y

+
k=1

ln q  (0|yk ) + ln q  (1|y

Now to develop an estimate to the gradient of E[r(X1 , A1 )] and E[T1 ], observe that E[r(X1 , A1 )] is just the reward obtained in an episodic problem (the reward is received at the end of the episode). Hence, one can use the method of likelihood ratios, also known as the REINFORCE algorithm in the reinforcement learning literature, to calculate the derivative (Williams, 1992). To deal with E[T1 ], recall that stage k of the thinking process costs k . This is just the expected value of the total cost in this finite horizon, episodic problem, and hence we can use REINFORCE again to estimate the derivative. In detail, let (Qtk )k=1,...,Lt be the decision at stage k on instance t. For each t  {1, . . . , n}, we define the ~ ^ ^ ^ trajectories Yt = [Yt0 , Yt1 , Qt1 , Yt2 , Qt2 , . . . , YtK , QtK ], ^ where Yt0 = Xt , Qtk = Qtk if k  Lt ; otherwise, by ^ definition, Qtk = 1. According to our notation, the final reward at the end of the tth trajectory is simply Rt = r(Xt , At ), which (as a slight abuse of notation) ~ ~ we will denote as r(Yt ). Note that r(Yt ) depends only on the first Lt stages of the thinking process. Letting f (~) denote the density function of trajecy tory y , and r(~) the corresponding reward, we have ~ y E[r(Xt , At )] = r(~)f (~)d~. (When the space of y y y trajectories is countable, f (~) should be understood y as the probability of seeing y under q .) ~ Under mild regularity conditions,   E[r(Xt , At )] = r(~)f (~)d~ y y y    ~  ~ = r(~) (ln f (~)) f (~)d~ = E[r(Yt ) ln f (Yt )]. y y y y  

where f0 is the density underlying Yt0 = Xt , pk is the Markov kernel underlying the k th transition and (~) = min { k > 0 : qk = 1 } is the index of y ^  ~ the stage in y when q quits. Thus,  ln f (Yt ) = ~ Lt -1     k=1  ln q (0|Ytk ) +  ln q (1|YtLt ), leading to the  unbiased estimate of  E[r(X1 , A1 )]: 1 ^ r = n
L -1 n

~ ~ r(Yt )s(Yt ),
t=1

t   ~ where s(Yt ) = k=1  ln q  (0|Ytk )+  ln q  (1|YtLt ). n 1 ~ ^ Similarly T = n t=1 Tt s(Yt ) is an unbiased esti mate of  E[T1 ]. Putting the pieces together, we see that the gradient of the performance criterion can be estimated with

^ Gn

=

1 n ×

n

t=1

~ r(Yt ) rTt ^ - ^ ^ T T2

×

Lt -1

k=1 n

  ln q  (0|Ytk ) + ln q  (1|YtLt ) ,  
L

t 1 ^ where T = n t=1 k=0 k is the empirical average n 1 ~ thinking time, and r = n t=1 r(Yt ) is the empirical ^ average reward. We will refer to this algorithm as Direct Gradient Ascent (DGA).

This algorithm proceeds as follows: given a parameter , it first simulates the policy q  on a batch of n samples, then it calculates the estimated gradient of the ^ parameters and updates them using new :=  + Gn , for some learning rate  > 0 (the learning rate may change over time). It then collects another batch of n samples and iterates.

Learning When to Stop Thinking

3.2. The Quality of the Estimated Gradient The previous section derived the DGA estimation of the gradient. This section estimates how far this estimated gradient is from the true gradient. Fix  and let G denote the true gradient. Observe that ^ r T ^ ^ r ^ - , Gn = ^ ^ ^ T T T G= E[r1 ] E[T1 ] E[r1 ] - . E[T1 ] E[T1 ] E[T1 ]

^ the angle between G and Gn must be smaller than 90o ^ T G > 0. -- i.e., Gn Since we do not know G , we must estimate it from the sample. With probability  1 - , both ^ ^ max(0, Gn - c(, n))  G and G - Gn  c(, n) hold. Hence, when c(, n)  1 ^ max(0, Gn - c(, n)) , 2 
1 2

(3)

Let · be an arbitrary vector norm. We can use Hoeffding's Inequality (1963) to bound the chance that our empirical estimates will be far from their true means: for every   [0, 1] and every n, there exist T , ^ r , T , r such that P(|E[T1 ] - T |  T )  1 - , ^ P(|E[r1 ] - r|  r )  1 - , P( E[T1 ] - T  ^ ^ r   )  1 - , T )  1 -  and P( E[r1 ] -  r where T , r , T , r = O( log(1/)/n). Using these quantities and applying the triangle inequality, we can bound the deviation between the true gradient ^ G and its empirical estimation Gn : ^ G - Gn +  r ^ E[r1 ] - ^ E[T1 ] T r T ^ ^ ^ T E[T1 ] + E[T1 ] - E[T1 ] E[T1 ] .

^ we also have G - Gn following result:

G . This proves the

Theorem 2. Fix 0 <  < 1 and let n = n() be the first (random) time when Equation (3) holds. Then ^ GT G > 0 with probability  1 - . n Further, it is possible to the bound the expected number of samples required to achieve this bound. Another possible improvement is to use an empirical Bernstein bound (taking into account the variances) instead of relying on Hoeffding's inequality, as in Mnih et al. (2008). Although these modifications might improve the efficiency, they are outside of the scope of the present work and are not explored here any further. 3.4. A Toy Problem Our experiments are based on the situation mentioned above, where we want to sort a sequence of envelopes based on their respective zipcodes. For each envelope Xt , we will apply a fixed sequence of (up to K = 4) subroutines A1 , . . . , AK , where each Ak requires a fixed amount of time k , and returns a response Ytk , i.e., , an interpretation of the zip code digits and some estimate of how certain this decision is. At each stage k of processing each envelope Xt , our stopping policy q decides whether to continue onto the next stage (running Ak+1 on this Xt ), or to "quit", which means it will follow the action At = µk (Yt,k ), which provides reward r(Xt , At ), then go on to consider the next envelope Xt+1 . We simulate this problem as follows: we characterize the difficulty of an envelope by h  Beta(1, 1), where h = 1 means very easy and h = 0 means very difficult. At stage k, let pk be the probability that yt,k is the correct zip code; here we initialize p0 = h, which improves as pk+1 = min{pk + 0.1, 1} (Hence, at stage k, pk = min{h + 0.1 × k, 1}). Reward r( Xt , µk (Yt,k ) ) equals to 1 with probability f (pk ) where f (·) is an un known function; we use f (p) = p in our experiments. In each stage Ytk by assumption contains pk . The decision to quit or not will then depend on pk , as well as the index k. We discretize the 2D space [0, 1] × {1, 2, 3, 4} into M grid cells and introduce the corresponding

^ E[r1 ] r - + ^ ^ E[T1 ] T T

2 Proposition 1. Assume that n  2 log(1/)/0 , where 0 is an almost sure lower bound on T1 . Then with probability 1 - , log(4/) log(4/) ^ + c2 , G - Gn  c1 n n where c1 , c2 are constants that depend only on the range of the rewards, thinking times and their gradients.

With some more tedious calculations (using the triangle inequality several times and (1 - x)-1  1 + 2x for x  1/2), we arrive at the following result:

3.3. A Stopping Rule for Preventing Slow Convergence Near Optima The previous section proved that, with probability ^  1 - , G - Gn  c(, n), where c(·, ·) can be calculated exactly, knowing the range of the rewards and the thinking times, independent of . We can use this to ^ calculate a bound on n that ensures that G and Gn are sufficiently close to each other. However, knowing this is not enough for the purposes of gradient ascent: If we fix any  > 0, we may have G is O(); here knowing ^ that G - Gn <  does not guarantee that following ^ Gn moves the parameters in the direction of the gradient. The idea here is to let  change depending on the 1 ^ length of G. In particular, if G - Gn  2 G then

Learning When to Stop Thinking

3.5. Face Detection
200

150
Frequency

100

50 BG AG 0 0 2 4
Performance

6

8

Figure 2. The performance histogram of a set of parameters before (BG) and after (AG) applying the DGA method.

(·, ·)  {0, 1}M feature vector, where i (pk , k) = 1 if and only if (pk , k) belongs to the ith grid cell, and otherwise j (pk , k) = 0, j = i. In our experiments we used M = 25 grid cells. The stopping policy is q  (0|pk , k) = g(T (pk , k)), where g(x) = 1/(1+e-x ) is the sigmoid function. The gradient of this func tion has a simple form:  ln q  (|pk , k) = (1 -  -  q (0|pk , k)) (pk , k) for   {0, 1}. We have to wait for 0 = 0.1 time unit before observing each Xt . The thinking time for each subsequent stage k is 0.002 + 0.2495 × k; note that the thinking time at stage k = 4 is 1. First, we generated 1,000,000 random parameter vectors. Each vector represents a policy. We ran each policy on 100 samples (e.g., envelopes) and computed the average payoff of the policy over these samples. The highest, lowest and average performances of these policies were 6.85, 0.34 and 1.60, respectively. Next, we generated 1000 initial parameters randomly and evaluated them. The performance histogram of these parameters is denoted by BG (Before Gradient) in Figure 2. For each of these parameter values, we ran our simulator for 1,000 timesteps. In each timestep, the DGA algorithm uses n =10 trajectories to approximate its gradient. (We decrease the learning rate according to t = 5/t for each run). Finally, after applying 1,000 gradient steps, we evaluated the parameters. The performance histogram of the resulting parameters is denoted by AG (After Gradient) in Figure 2. This figure shows that DGA improves the policies considerably.

In this section we demonstrate how our approach can be used in face detection applications. We ran our experiments on the face database (Viola & Jones, 2001), which contains 4916 pieces of facial and 7872 pieces of non-facial gray scale images of size 24 × 24. We also used the 22-stage hierarchical face classifier of Lienhart et al. (2003), which is an implementation of the AdaBoost algorithm (Viola & Jones, 2001), downloaded from Bradski and Kaehler (2008); we call this classification algorithm VJ. Here, each stage can classify images, and while the higher level classifiers perform better, they have higher complexity and are more costly (this is because the higher level stages require more feature evaluations). In the original version of VJ, this 22-stage classifier contained 22 parameters (k  R, k = 1, . . . , 22). If k is smaller than the response of stage k (which is the weighted sum of the features on stage k), then the algorithm proceeds to stage k + 1, where it runs a more complex classifier; otherwise it classifies the picture as non-face. If the algorithm reaches stage 22, and the response of this stage > 22 , then the picture is classified as a face. Each stage gets its input from the previous stage, and the parameter k is set to the value where the actual classifier recognizes 99.5% of faces, and rejects about 30% of the non-faces. The problem with this approach is that the algorithm is not able to classify an image as a face before reaching stage 22, not allowing the process to quit earlier with a positive result. Also, the fixed 0.995 true positive rate (TPR) on each stage seems to be ad-hoc, too, and as we will see, we can optimize these parameters to achieve better results. To handle these problems we made the following modifications. On each stage, we introduced 2 parameters k < k to help define the stopping and classification policy. Let Ytk denote the response of stage k on the tth instance. If Ytk < k on stage k, then we quit thinking, and decide that the object is not a face. If k < Ytk < k , then we keep thinking and proceed on stage k + 1. Finally, if k < Ytk , then we quit thinking, and classify the object as a face. In practice, however, we used the soft versions of these hard decision functions: here using qk (0|Ytk ) = 1/(1 + exp(-c(Ytk - k ))) × 1/(1 + exp(-c(k - Ytk ))) transition functions with c = 50, and a similar function for computing µ. The algorithm was forced to quit on stage 22; here the classification was based on parameter 22 . While our method is similar to Wald's sequential probability ratio test (1948), note that we tune the parameters k and k , we do not know the exact likelihood functions,

Learning When to Stop Thinking

and we apply soft decisions only. The reward after each decision was determined by an R  R2×2 reward matrix. In the following experiment, we set R11 = R22 = 100, R12 = R21 = 0, corresponding respectively to true positive, true negative, false positive and false negative results. To reduce the chance of being trapped in a local minima, we combined the gradient descent approach with the Cross-Entropy (CE) method (Rubinstein & Kroese, 2004): We generated 100 random initial parameters from some N (0 , 0 ) normal distribution. We evaluated the performances of these parameters and selected the best 10 results -- i.e., the "elite" samples. From these elite samples we move into the direction of the respective gradients with a fixed stepsize. After calculating the empirical mean and vari¯ ¯ ance of these new improved samples (denoted 0 , 0 ) we update the internal parameters of the CE-method ¯ ¯ -- 1 = 0 + (1 - )0 , 1 = 0 + (1 - )0 -- then repeat the whole CE process from the beginning: again generating 100 new samples from distribution N (1 , 1 ) and so on. Here we used 25 CE iterations, and set  to 0.9. As this method combines CE with local search in the directions of the gradients, we call it CE-DGA. We assumed a fixed 0 minimal thinking time for each instance. We tested our method using several R reward matrix and 0 minimal thinking time, and found empirically that our method always performed better than the purely randomly selected parameters. However, when 0 is set to a large value (40,000 in our experiments), then we found that the actual thinking time is not important, as our algorithm has to maximize only its expected reward, which is TPR+(1FPR), where FPR is the false positive rate. Here we can compare our performance with VJ. The following results show that our proposed algorithm actually outperforms VJ here, even when the thinking time is not crucial. For the experiments, we selected the first 1, 000 facial and 1, 000 non-facial images from the VJ database. Table 1 shows the empirical average reward, the average number of stages needed for classification, as well as the TPR and FPR values for the VJ algorithm, and for the parameters optimized by the proposed CE-DGA algorithm. It also shows the average performance values of 100 randomly selected parameters. It is interesting that CE-DGA could achieve higher expected reward, higher TPR, and smaller FPR than the VJ parameters, while using many fewer classification stages (6.1 instead of 13.17 on average)! Figure 3 shows the performances of several randomly selected

Table 1. Expected reward, Expected stage numbers, True Positive Rate (TPR), False Positive Rate (FPR) for the Viola-Jones parameters (VJ), random parameters, and our method, respectively. VJ 96.95 13.17 96.70% 2.80% Random 76.80 1.25 99.20% 45.60% CE-DGA 97.70 6.1 97.00% 1.60%

E[R] E[stage] TPR FPR

parameters, as well as the performances of the VJ and the CE-DGA methods, on the "ROC domain" -- i.e., the True Positive Rate × False Positive Rate coordinate system. One could use CE only, i.e., without local search in every iteration. Nonetheless, in Figure 4 we demonstrate that using the gradients, too, we can achieve better performance in the first few CE iterations. Later, as the iterations proceed, however, the objective becomes flat, and the gradient can not help anymore. To have a fair comparison between CE and CE-DGA we allowed the CE method to sample in each iteration as many samples as the CE-DGA evaluated in that iteration.
1 True positive 0.99 0.98 0.97 0.96 0.95 0 0.2 random VJ CE-DGA 0.4 0.6 False positive 0.8

Figure 3. The ROC domain. The performances of randomly drawn parameters, Viola-Jones parameters (VJ), and parameters optimized by the CE-DGA method, respectively, shown in the True Positive Rate × False Positive Rate coordinate system.

4. Discussion and Conclusion
This article considered only the trade-off between the accuracy of a performance system versus its computational time cost. It is straightforward to generalize the approach when other costs are also incurred (e.g., material costs in a physical process) while processing the instances. We considered learning the best param  eters  only for the stopping policy q  = (q1 , . . . , qK ), which governs the time allowed to process the input instance. This was only to simplify the presentation of the task: the techniques developed here can also be

Learning When to Stop Thinking

0.96 0.94 E[R] 0.92 0.9 0 5 CE iterations 10 CE-DGA CE 15

Bradski, G., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Press. Greensmith, E., Bartlett, P., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5, 1471­1530. Greiner, R., Grove, A., & Roth, D. (2002). Learning cost-sensitive active classifiers. Artificial Intelligence, 139, 137­174. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13­30. Lienhart, R., Kuranov, A., & Pisarevsky, V. (2003). Empirical analysis of detection cascades of boosted classifiers for rapid object detection. DAGM'03, 25th Pattern Recognition Symposium (pp. 297­304). Mnih, V., Szepesvari, C., & Audibert, J.-Y. (2008). Empirical Bernstein stopping. International Conference on Machine learning (pp. 672­679). Rubinstein, R. Y., & Kroese, D. P. (2004). The crossentropy method. Information Science and Statistics. Springer. Russell, S., & Subramanian, D. (1995). Provably bounded-optimal agents. Journal of Artificial Intelligence Research, 2, 575­609. Shiryaev, A. (2007). Optimal stopping rules. Springer. Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. MIT Press. Turney, P. (2000). Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning (International Conference on Machine Learning). Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition (pp. 511­518). Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test. Annals of Mathematical Statistics, 19, 326­339. Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8, 229­256. Zilberstein, S. (1996). Using anytime algorithms in intelligent systems. AI magazine, 17, 73­83.

Figure 4. CE vs CE-DGA. With the combination of gradient ascent and CE (CE-DGA), in the first few iterations we can achieve higher performance than using CE only. The curves are averaged for 10 experiments. The error bars indicate the standard deviations.

used to optimize the "action policy" -- to determine both the final decision policy µ = µ1 , . . . , µk and the parameters of the actual sequence of "subroutines" Ai being followed. Contributions: This article investigates when to stop thinking about specific tasks, when we have to consider both the quality of our solution to the task and also the cost of thinking. We proposed an algorithm for this problem. To reduce the chance of being caught in local minima, we extended the gradient ascent method with Cross-Entropy optimization. We theoretically examined how far the estimated gradient can be from the true gradient, and provided experiments on a real-world face detection task. We found that in this task our proposed CE-DGA method could achieve higher true positive rate and smaller false positive rate than the VJ with its standard weights, while using many fewer classification stages. We also note that our gradient estimates can be biased and hence can indeed point into the "wrong" direction. We therefore provide a stopping rule (Theorem 2) that guarantees (with high probability) that the estimated gradient does not point into the wrong direction. Acknowledgements The motivation for this paper comes from a discussion with Matt Ginsberg. We gratefully acknowledge support from the Alberta Ingenuity Centre for Machine Learning, the Alberta Ingenuity Fund, iCORE and NSERC. Csaba Szepesv´ri is on leave from MTA a SZTAKI, Bp. Hungary.

References
Baxter, J., & Bartlett, P. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319­350.