Combining Expert Advice Efficiently Wouter M. Koolen and Steven de Rooij Centrum voor Wiskunde en Informatica (CWI) Kruislaan 413, P.O. Box 94079 1090 GB Amsterdam, The Netherlands {W.M.Koolen-Wijkstra,S.de.Rooij}@cwi.nl Abstract We show how models for prediction with expert advice can be defined concisely and clearly using hidden Markov models (HMMs); standard HMM algorithms can then be used to efficiently calculate how the expert predictions should be weighted according to the model. We cast many existing models as HMMs and recover the best known running times in each case. We also describe two new models: the switch distribution, which was recently developed to improve Bayesian/Minimum Description Length model selection, and a new generalisation of the fixed share algorithm based on runlength coding. We give loss bounds for all models and shed new light on the relationships between them. 1 Introduction We cannot predict exactly how complicated processes such as the weather, the stock market, social interactions and so on, will develop into the future. Nevertheless, people do make weather forecasts and buy shares all the time. Such predictions can be based on formal models, or on human expertise or intuition. An investment company may even want to choose between portfolios on the basis of a combination of these kinds of predictors. In such scenarios, predictors typically cannot be considered "true". Thus, we may well end up in a position where we have a whole collection of prediction strategies, or experts, each of whom has some insight into some aspects of the process of interest. We address the question how a given set of experts can be combined into a single predictive strategy that is as good as, or if possible even better than, the best individual expert. The setup is as follows. Let be a finite set of experts. Each expert issues a distribution P (xn+1 |xn ) on the next outcome xn+1 given the previous observations xn := x1 , . . . , xn . Here, each outcome xi is an element of some countable space X , and random variables are written in bold face. The probability that an expert assigns to a sequence of outcomes is given by the chain rule: P (xn ) = P (x1 ) ˇ P (x2 |x1 ) ˇ . . . ˇ P (xn |xn-1 ). A standard Bayesian approach to combine the expert predictions is to define a prior w on the experts which induces a joint distribution with mass function P (xn , ) = w( )P (xn ). Inference is then based on this joint distribution. We can compute, for exa ple: (a) the marginal m n probability of the data P (xn ) = w( )P (x ), (b) the predictive distribution on the next outcome P (xn+1 |xn ) = P (xn , xn+1 )/P (xn ), which defines a prediction strategy that combines those of the individual experts, or (c) the posterior distribution on the experts P ( |xn ) = P (xn )w( )/P (xn ), which tells us how the experts' predictions should be weighted. This simple probabilistic approach has the advantage that it is computationally easy: predicting n outcomes using || experts requires only O(n ˇ ||) time. Additionally, this Bayesian strategy guarantees that the overall probability of the data ^ is only a factor w( ) smaller than the probability of the data ^ according to the best available expert . On the flip side, with ^ this strategy we never do any better than either: we have n n n ^), which means that potenP (x ) P (x ) P (x )w( ^ ^ tially valuable insights from the other experts are not used to our advantage! More sophisticated combinations of prediction strategies can be found in the literature under various headings, including (Bayesian) statistics, source coding and universal prediction. In the latter the experts' predictions are not necessarily probabilistic, and scored using an arbitrary loss function. In this paper we consider only logarithmic loss, although our results can probably be generalised to the framework described in, e.g. [12]. The three main contributions of this paper are the following. First, we introduce prior distributions on sequences of experts, which allows unified description of many existing models. Second, we show how HMMs can be used as an intuitive graphical language to describe such priors and obtain computationally efficient prediction strategies. Third, we use this new approach to describe and analyse several important existing models, as well as one recent and one completely new model for expert tracking. 1.1 Overview In §2 we develop a new, more general framework for combining expert predictions, where we consider the possibility that the optimal weights used to mix the expert predictions may vary over time, i.e. as the sample size increases. We stick to Bayesian methodology, but we define the prior distribution as a probability measure on sequences of experts rather than on experts. The prior probability of a sequence 1 , 2 , . . . is the probability that we rely on expert 1 's prediction of the first outcome and expert 2 's prediction of the second outcome, etc. This allows for the expression of more sophisticated models for the combination of expert predictions. For example, the nature of the data generating process may evolve over time; consequently different experts may be better during different periods of time. It is also possible that not the data generating process, but the experts themselves change as more and more outcomes are being observed: they may learn from past mistakes, possibly at different rates, or they may have occasional bad days, etc. In both situations we may hope to benefit from more sophisticated modelling. Of course, not all models for combining expert predictions are computationally feasible. §3 describes a methodology for the specification of models that allow efficient evaluation. We achieve this by using hidden Markov models (HMMs) on two levels. On the first level, we use an HMM as a formal specification of a distribution on sequences of experts as defined in §2. We introduce a graphical language to conveniently represent its structure. These graphs help to understand and compare existing models and to design new ones. We then modify this first HMM to construct a second HMM that specifies the distribution on sequences of outcomes. Subsequently, we can use the standard dynamic programming algorithms for HMMs (forward, backward and Viterbi) on both levels to efficiently calculate most relevant quantities, most importantly the marginal probability of the observed outcomes P (xn ) and posterior weights on the next expert given the previous observations P (n+1 |xn ). It turns out that many existing models for prediction with expert advice can be specified as HMMs. We provide an overview in §4 by giving the graphical representations of the HMMs corresponding to the following three models. First, universal elementwise mixtures (sometimes called mixture models) that learn the optimal mixture parameter from data. Second, Herbster and Warmuth's fixed share algorithm for tracking the best expert [4, 5]. Third, universal share, which was introduced by Volf and Willems as the switching method [11] and later independently proposed by Bousquet [1]. Here the goal is to learn the optimal fixed-share parameter from data. We render each model as a prior on sequences of experts by giving its HMM. The size of the HMM immediately determines the required running time for the forward algorithm. The generalisation relationships between these models as well as their running times are displayed in Figure 1. In each case this running time coincides with that of the best known algorithm. We also give a loss bound for each model, relating the loss of the model to the loss of the best competitor among a set of alternatives in the worst case. Such loss bounds can help select between different models for specific prediction tasks. Besides the models found in the literature, Figure 1 also includes two new generalisations of fixed share: the switch distribution and the run-length model. These models are the subject of §5. The switch distribution was introduced in [10] as a practical means of improving Bayes/Minimum Description Length prediction to achieve the optimal rate of convergence in nonparametric settings. Here we give the concrete HMM that allows for its linear time computation. The run-length model is based on a distribution on the number of Figure 1 Expert sequence priors: generalisation relationships and run time fixed expert O(n) fixed elementwise mixture fixed share Bayesian mixture fixed overconfident experts switch distribution O(n||) universal share run-length model universal overconfident experts O(n2 ||) universal elementwise mixture O(n|| ) successive outcomes that are typically well-predicted by the same expert. Run-length codes are typically applied directly to the data, but in our novel application they define the prior on expert sequences instead. Again, we provide the graphical representation of their defining HMMs as well as loss bounds. We conclude by comparing the two models. 2 Expert Sequence Priors In this section we explain how expert tracking can be described in probability theory using expert sequence priors (ES-priors). These ES-priors are distributions on the space of infinite sequences of experts that are used to express regularities in the development of the relative quality of the experts' predictions. As illustrations we render Bayesian mixtures and elementwise mixtures as ES-priors. In the next section we show how ES-priors can be implemented efficiently by hidden Markov models. Notation We denote by N the natural numbers including zero, and by Z+ the natural numbers excluding zero. Let Q be a set. We denote the cardinality of Q by |Q|. For any natural number n, we let the variable q n range over the nfold Cartesian product Qn , and we write q n = q1 , . . . , qn . We also let q range over Q -- the set of infinite sequences over Q -- and write q = q1 , . . . . We read the statement q Q to first bind and subsequently q Q . If q is a sequence, and , then we denote by q the prefix of q of length . Forecasting System Let X be a countable outcome space. We use the notation X for the set of all finite sequences over X and let (X ) denote the set of all probability mass functions on X . A (prequential) X -forecasting system (PFS) is a function P : X (X ) that maps sequences of previous observations to a predictive distribution on the next outcome. Prequential forecasting systems were introduced by Dawid in [2]. Distributions We also use probability measures on spaces of infinite sequences. In such a space, a basic event is the set of all continuations of a given prefix. We identify such events with their prefix. Thus a distribution on X is defined by a function P : X [0, 1] that satisfies P () = 1, where is the empty sequence, and for all n 0, all xn X n x n we have X P (x1 , . . . , xn , x) = P (x ). We identify P with the distribution it defines. We write P (xn |xm ) for P (xn )/P (xm ) if 0 m n. Note that forecasting systems continue to make predictions even after they have assigned probability 0 to a previous outcome, while distributions' predictions become undefined. Nonetheless we use the same notation: we write P (xn+1 |xn ) for the probability that a forecasting system P assigns to the n + 1st outcome given the first n outcomes, as if P were a distribution. ES-Priors The slogan of this paper is we do not understand the data. Instead of modelling the data, we work with experts. We assume that there is a fixed set of experts , and that each expert predicts using a forecasting system P . We are interested in switching between different forecasting systems at different sample sizes. For a sequence of experts with prefix n , the combined forecast, where expert i predicts the ith outcome, is denoted Pn (xn ) := in Pi (xi |xi-1 ). At each moment in time we predict the data using the posterior, which is a mixture over our experts' predictions. Ideally, the ES-prior should be chosen such that the posterior coincides with the optimal mixture weights of the experts at each sample size. The traditional interpretation of our ES-prior as a representation of belief about an unknown "true" expert sequence is tenuous, as normally experts do not generate data, they only predict it. Moreover, by mixing different expert sequences, it is often possible to predict significantly better than by using any single sequence of experts, a feature that is crucial to the performance of many of the models that will be described below and in §4. In the remainder of this paper we motivate ES-priors by giving performance guarantees in the form of bounds on running time and loss. 2.1 Examples We now show how two ubiquitous models can be rendered as ES-priors. Example 2.1.1 (Bayesian Mixtures). Let be a set of experts, and let P be a PFS for each . Suppose that we do not know which expert will make the best predictions. Following the usual Bayesian methodology, we combine their predictions by conceiving a prior w on , which (depending on the adhered philosophy) may or may not be interpreted as an expression of one's beliefs in this respect. Then the standard Bayesian mixture Pbayes is given by Pbayes (xn ) = P (xn )w( ). (4 ) =1 We adopt shorthand notation for events: we write P (S ), where S is a subsequence of n and/or of xn , for the probability under P of the set of sequences of pairs which match S exactly. For example, the marginal probability of a sequence of outcomes is: P (xn ) = P ( n , xn ). (2 ) n n Adopting Bayesian methodology, we impose a prior on infinite sequences of experts; this prior is called an expert sequence prior (ES-prior). Inference is then based on the distribution on the joint space (X × ) , called the ES-joint, which is defined as follows: P := ( n )Pn (xn ). (1 ) 1 , x1 , . . . , n , xn n Recall that P (xn ) means i=1 P (xi |xi ). The Bayesian mixture is not an ES-joint, but it can easily be transformed into one by using the ES-prior that assigns probability w( ) to the identically- sequence for each : bayes ( ) = n w 0 (k ) if i = k for all i = 1, . . . , n, o.w. We will use the adjective "Bayesian" generously throughout this paper, but when we write the standard Bayesian ESprior this always refers to bayes . 3 Example 2.1.2 (Elementwise Mixtures). The elementwise mixture1 is formed from some mixture weights () by in P (xi |xi-1 )( ) . Pmix, (xn ) := =1 CompaPe this to the i sual Bayesian statistics, where a model r u class | s also endowed with a prior distribution w on . Then, after observing outcomes xn , inference is based on the posterior P ( |xn ) on the parameter, which is never actually observed. Our approach is exactly the same, but we always consider = . Thus as usual our predictions are based on the posterior P ( |xn ). However, since the predictive distribution of xn+1 only depends on n+1 (and xn ) we always marginalise as follows: n n n n P ( , x ) ˇ (n+1 | ) n P (n+1 |x ) = . (3 ) n n n P ( , x ) In the preceding definition, it may seem that elementwise mixtures do not fit in the framework of ES-priors. But we 1 These mixtures are sometimes just called mixtures, or predictive mixtures. We use the term elementwise mixtures both for descriptive clarity and to avoid confusion with Bayesian mixtures. can rewrite this definition in the required form as follows: Pmix, (xn ) = = = in P (xi |xi-1 )( ) Pi (xi |x i-1 =1 hidden Markov model on O if Q is a countable set, Qp Q, P (Q), P : Q (Q) and Pq is an O-forecasting system for each q Qp . Terminology and Notation We call elements of Q states. We call the states in Qp productive and the other states silent. We call P the initial distribution, let I denote its support (i.e. q ) I := Q | P (q ) > 0 and call I the set of initial states. We call P the stochastic transition function. We let Sq denote the support of P(q ), and call q Sq a direct successor of q . We abbreviate P(q )(q ) to P(q q ). A finite or infinite sequence of states q Q is called a branch through A. A branch q is called a run if either = 0 (so q = ), or q1 I and qi+1 Sqi for all 1 i < . A finite run q n = is called a run to qn . For each branch q , we denote by qp its subsequence of productive states. We denote the elements of p p qp by q1 , q2 etc. We call an HMM continuous if qp is infinite for each infinite run q . Restriction In this paper we will only work with continuous HMMs. This restriction is necessary for the following to be well-defined. Definition 2. An HMM A defines the following distribution on sequences of states. A () := 1, and for 1 A (q ) := P (q1 ) -1 i =1 n n Pn (xn )mix, ( n ), n in )(i ) ( 5 a) =1 which is the ES-joint based on the prior mix, ( n ) := in (i ). (5 b ) =1 Thus, the ES-prior for elementwise mixtures is just the product distribution of . 3 We mentioned above that ES-priors cannot be interpreted as expressions of belief about individual expert sequences. This is a prime example, as the ES-prior is crafted such that its posterior mix, (n+1 | n ) exactly coincides with the desired mixture of experts. 3 Expert Tracking using HMMs We explained in the previous section how expert tracking can be implemented using expert sequence priors. In this section we specify ES-priors using hidden Markov models (HMMs). The advantage of using HMMs is that the complexity of the resulting expert tracking procedure can be read off directly from the structure of the HMM. We first give a short overview of the particular kind of HMMs that we use throughout this paper. We then show how HMMs can be used to specify ES-priors. As illustrations we render the ESpriors that we obtained for Bayesian mixtures and elementwise mixtures in the previous sections as HMMs. In §4 we provide an overview of ES-priors and their defining HMMs that are found in the literature. 3.1 Hidden Markov Models Overview Hidden Markov models (HMMs) are a well-known tool for specifying probability distributions on sequences with temporal structure. Furthermore, these distributions are very appealing algorithmically: many important probabilities can be computed efficiently for HMMs. These properties make HMMs ideal models of expert sequences: ES-priors. For an introduction to HMMs, see [9]. We require a slightly more general notion that incorporates silent states and forecasting systems as explained below. We define our HMMs on a generic set of outcomes O to avoid confusion in later sections, where we use HMMs in two different contexts. First in §3.2, we use HMMs to define ES-priors, and instantiate O with the set of experts . Then in §3.4 we modify the HMM that defines the ESprior to incorporate the experts' predictions, whereupon O is instantiated with the set of observable outcomes X . Definition 1. Let O be a finite set of outcomes. We call a quintuple Q Pq A= , Qp , P , P, q Qp P(qi qi+1 ). Then via the PFSs, A induces the joint distribution PA on runs and sequences of outcomes. Let on On be a sequence of outcomes and let q = be a run with at least n productive states, then PA (on , q ) := A (q ) in Pqp (oi |oi-1 ). i =1 The value of PA at arguments on , q that do not fulfil the condition above is determined by the additivity axiom of probability. The Forward Algorithm For a given HMM A and data on , the forward algorithm (c.f. [9]) computes the marginal probability PA (on ). The forward algorithm operates by percolating weights along the transitions of the HMM. The running time is proportional to the number of transitions that need to be considered. Details can be found in [6]. In this paper we present all HMMs unfolded, so that each transition needs to be considered exactly once, and hence the running time can be read off easily. 3.2 HMMs as ES-Priors In applications HMMs are often used to model data. This is often useful if there are local correlations between outcomes. A graphical model depicting this approach is displayed in Figure 2a. In this paper we use HMMs as ES-priors, that is, to specify temporal correlations between the performance of our experts. Thus instead of concrete observations our HMMs will a "produce" sequences of experts, that are never actually observed. Figure 2b. illustrates this approach. Using HMMs as priors allows us to use the standard algorithms for HMMs to answer questions about the prior. For example, we can use the forward algorithm to compute the prior probability of the sequence of one hundred experts with expert number one at all odd indices and expert number two at all even indices. However, we are obviously also interested in questions about the data rather than about the prior. In §3.4 we show how joints based on HMM priors (Figure 2c) can be transformed into ordinary HMMs (Figure 2a) with at most a ||-fold increase in size, allowing us to use the standard algorithms for HMMs not only for the experts, but for the data as well, with the same increase in complexity. This is the best we can generally hope for, as we now need to integrate over all possible expert sequences instead of considering only a single one. Here we first consider properties of HMMs that represent ES-priors. Restriction HMM priors "generate", or define the distribution on, sequences of experts. But contrary to the data, which are observed, no concrete sequence of experts is realised. This means that we cannot conveniently condition p the distribution on experts in a productive state qn on the sen-1 quence of previously produced experts . In other words, we can only use an HMM on as an ES-prior if the forecasting systems in its states are simply distributions, so that all dependencies between consecutive experts are carried by the state. This is necessary to avoid having to sum over all (exponentially many) possible expert sequences. Deterministic Under the restriction above, but in the presence of silent states, we can make any HMM deterministic in the sense that each forecasting system assigns probability one to a single outcome. We just replace each productive state q Qp by the following gadget: A B Figure 3 Standard Bayesian mixture. A,1 A,2 A B,1 A B,2 A A B C,1 B C,2 B B C D,1 C D,2 C C D D D D We draw an arrow from Nq to Nq if q is a direct successor of q . We often reify the initial distribution P by including a virtual node, drawn as an open circle, e.g. , with an outgoing arrow to Nq for each initial state q I . The transition probability P (q q ) is not displayed in the graph. 3.3 Examples We are now ready to give the deterministic HMMs that correspond to the ES-priors of our earlier examples from §2.1: Bayesian mixtures and elementwise mixtures with fixed parameters. Example 3.3.1 (HMM for Bayesian Mixtures). The Bayesian mixture ES-prior bayes as introduced in Example 2.1.1 represents the hypothesis that a single expert predicts best for all sample sizes. A simple deterministic HMQ on that genM , erates the prior bayes is given by Abayes = , Qp , P , P, wh e r e Q, Qp = × Z+ ( , n) = P ( , 1) = w( ) ( 6 a) P , n , n + 1 = 1 (6 b ) The diagram of (6) is displayed in Figure 3. From the picture of the HMM it is clear that it computes the Bayesian mixture. Hence, using (4), the loss of the HMM with prior w is bounded for all data xn and all experts by (7 ) - log PAbayes (xn ) + log P (xn ) - log w( ). q C D E In the left diagram, the state q has distribution Pq on outcomes O = {A, . . . , E}. In the right diagram, the leftmost silent state has transition probability Pq (o) to a state that deterministically outputs outcome o. We often make the funca Q tional relationship explicit and by calling , Qp , P , P, deterministic HMM on O if : Qp O. Here we slightly abuse notation; the last component of a (general) HMM assigns a PFS to each productive state, while the last component of a deterministic HMM assigns an outcome to each productive states. Sequential prediction using a general HMM or its deterministic counterpart costs the same amount of work: the |O|fold increase in the number of states is compensated by the |O|-fold reduction in the number of outcomes that need to be considered per state. Diagrams Deterministic HMMs can be graphically represented by pictures. In general, we draw a node Nq for each state q . We draw a small black dot, e.g. , for a silent state, and an ellipse labelled (q ), e.g. D , for a productive state. ^ In particular this bound holds for = argmax P (xn ), so we predict as well as the single best expert with constant overhead. Also PAbayes (xn ) can obviously be computed in O(n||) using its definition (4). We show in [6] that computing it using the HMM prior above gives the same running time O(n||), a perfect match. 3 Example 3.3.2 (HMM for Elementwise Mixtures). We now present the deterministic HMM Amix, that implements the ES-prior mix, of Example 2.1.2. Its diagram is displayed in Figure 4. The HMM has a single silent state per outcome, and its transition probabilities are the mixture weights . Formally, Amix, is given using Q = Qs Qp by Qs = {p} × N P (p, 0) = 1 Qp = × Z+ ( , n) = p , n , n + 1 ( ) , n p, n = 1 ( 8 a) ( P 8b) The vector-style definition of P is shorthand for one P per line. We show in [6] that this HMM allows us to compute 3 PAmix, (xn ) in time O(n||). Figure 2 HMMs. q p , i and xi are the ith productive state, expert and observation. i (a) Standard use of HMM (b) HMM ES-prior (c) Application to data q1 p q2 p q2 p q1 1 p q2 2 p q2 3 ˇˇˇ p qp 1 1 x1 qp 2 2 qp 2 3 ˇˇˇ x1 x2 |x1 x3 |x2 ˇˇˇ x2 |x1 x3 |x2 ˇˇˇ Figure 4 Fixed elementwise mixture A,1 A,2 A B,1 p,0 A B,2 p,1 A A B C,1 B C,2 p,2 B p,3 B C D,1 C D,2 C C We have presented two examples so far, the Bayesian mixture and the elementwise mixture with fixed coefficients (Examples 3.3.1 and 3.3.2). The latter model is parameterised. Choosing a fixed value for the parameter beforehand is often difficult. The first model we discuss learns the optimal parameter value on-line, at the cost of only a small additional loss. We then proceed to discuss a number of important existing expert models. 4.1 Universal Elementwise Mixtures A distribution is "universal" for a family of distributions if it incurs small additional loss compared to the best member of the family. A standard Bayesian mixture constitutes the simplest example. It is universal for the fixed expert model, where the unknown parameter is the used expert. For the uniform prior, the additional loss (7) is at most log||. In Example 3.3.2, we described elementwise mixtures with fixed coefficients as ES-priors. Prior knowledge about the mixture coefficients is often unavailable. We now expand this model to learn the optimal mixture coefficients from the data, resulting in a distribution that is universal for the fixed elementwise mixtures. To this end we place a prior distribution w on the space of mixture weights (). Using (5) we obtain the following marginal distribution: Pmix, (xn )w() d Pumix (xn ) = ( ) = Pn (xn )mix, ( n )w() d ( ) n D D D D 3.4 The HMM for Data We obtain our model for the data (Figure 2c) by composing an HMM prior on with a PFS P for each expert . We now show that the resulting marginal distribution on data can be implemented by a single HMM on X (Figure 2a) with the same number of states as the HMM prior. Let P be an X th o -forecasting system for each , and letQ e ES-prior A be given by the deterministic HMM A = , Qp , P , P, n . Then the marginal distribution of the data (see (1)) is given by PA (xn ) = A ( n ) n in =1 Pi (xi |xi-1 ). q o n X in- The HMM X := duces the same marginal distribution (see Definition 2). That is, PX (xn ) = PA (xn ). Moreover, X contains only the forecasting systems that also exist in A and it retains the structure of A. In particular this means that the algorithms for HMMs have the same running time on the prior A as on the marginal X. Q , Qp , P , P, P (q ) Qp = umix ( n ) = Pn (xn )umix ( n ), (9 ) wh e r e n mix, ( n )w() d. ( ) 4 Zoology Perhaps the simplest way to predict using a number of experts is to pick one of them and mirror her predictions exactly. Beyond this "fixed expert model", we have considered two methods of combining experts so far, namely taking Bayesian mixtures, and taking elementwise mixtures as described in §3.3. Figure 1 shows these and a number of other, more sophisticated methods that fit in our framework. The arrows indicate which methods are generalised by which other methods. They have been partitioned in groups that can be computed in the same amount of time using HMMs. Thus Pumix is the ES-joint with ES-prior umix . This applies more generally: parameters can be integrated out of an ESprior regardless of which experts are used, since the expert predictions Pn (xn ) do not depend on . We will proceed to calculate a loss bound for the universal elementwise mixture model, showing that it really is universal. After that we will describe how it can be implemented as an HMM. 4.1.1 A Loss Bound In this section we relate the loss of a universal elementwise mixture with the loss obtained by the maximum likelihood elementwise mixture. While mixture models occur regularly in the statistical literature, we are not aware of any appearance in universal prediction. Therefore, to the best of our knowledge, the following simple loss bound is new. Our goal is to obtain a bound in terms of properties of the prior. A difficulty here is that there are many expert sequences exhibiting mixture frequencies close to the maximum likelihood mixture weights, so that each individual expert sequence contributes relatively little to the total probability (9). The following theorem is a general tool to deal with such situations. Theorem 3. Let , be ES-priors s.t. is zero whenever is. Then for all xn , reading 0/0 = 0, P (xn ) ( n ) max . n) n P (x ( n ) Proof. Clearly P is zero whenever P is. Thus nn n P (x , ) P (xn , n ) P (xn ) max = nn n P (xn , n ) P (xn ) n P (x , ) = max n Figure 5 Universal elementwise mixture (two experts only) 0,3 A 0,2 A B 0,1 A B D,2 A 0,0 A B D,1 A B B D,0 A B C,1 A B C,0 A B B B,0 A B ( n ) Pn (xn )( n ) = max . n ) ( n ) n ( n ) Pn (x 4.1.2 HMM While universal elementwise mixtures can be described using the ES-prior umix defined in (9), unfortunately any HMM that computes it needs a state for each possible count vector, and is therefore huge if the number of experts is large. The HMM Aumix for an arbitrary number of experts using 1 the 2 , . . . , 1 Dirichlet prior is given using Q = Qs Qp 2 by Q s = N P Qp = N × P (0) = 1 (n, ) = n 1/2+n P n n, = ||/2+ n n (1 2 ) , + 1 1 Using this theorem, we obtain a loss bound for universal elementwise mixtures that can be computed prior to observation and without reference to the experts' PFSs. Corollary 4. Let Pumix be the universal elementwise mixture 1 model defined using the ( 1 , . . . , 2 )-Dirichlet prior (that is, 2 Jeffreys' prior) as the prior w() in (9). Let (xn ) maximise ^ the likelihood Pmix, (xn ) w.r.t. . Then for all xn the additional loss incurred by the universal elementwise mixture is bounded thus - log Pumix (xn ) + log Pmix,(xn ) (xn ) ^ for a fixed constant c. Proof. By Theorem 3 - log Pumix (xn ) + log Pmix,(xn ) (xn ) ^ - . max log umix ( n ) + log mix,(xn ) ( n ) (1 0 ) ^ n || - 1 n log + c 2 We now bound the right hand side. Let ( n ) maximise ^ mix, ( n ) w.r.t. . Then for all xn and n mix,(xn ) ( n ) mix,(n ) ( n ). ^ ^ 1 1 For the 2 , . . . , 2 Dirichlet prior, for all n - log umix ( n ) + log mix,(n ) ( n ) ^ (1 1 ) || - 1 n log + c 2 We write N for the set of assignments of counts to experts; 0 for the all zero assignment, and 1 marks one count for expert . We show the diagram of Aumix for the practical limit of two experts in Figure 5. In this case, the forward algorithm has running time O(n2 ). Each productive state in Figure 5 corresponds to a vector of two counts (n1 , n2 ) that sum to the sample size n, with the interpretation that of the n experts, the first was used n1 times while the second was used n2 times. These counts are a sufficient statistic for the multinomial model class: per (5b) and (9) the probability of the next expert only depends on the counts, and these probabilities are exactly the successor probabilities of the silent states (12). Other priors on are possible. In particular, when all mass is placed on a single value of , we retrieve the elementwise mixture with fixed coefficients. 4.2 Fixed Share The first publication that considers a scenario where the best predicting expert may change with the sample size is Herbster and Warmuth's paper on tracking the best expert [4, 5]. They partition the data of size n into m segments, where each segment is associated with an expert, and give algorithms to for some fixed constant c (see e.g. [13]) Combination with (11) and (10) completes the proof. Since the overhead incurred as a penalty for not knowing the optimal parameter (xn ) in advance is only logarithmic in ^ the sample size n, we find that Pumix is universal in a strong sense for the fixed elementwise mixtures. Figure 6 Fixed share A A A A value of has to be known in advance in order to minimise the loss. In Sections §4.3 and §5 we describe a number of generalisations of fixed share that avoid this problem. 4.3 Universal Share Volf and Willems describe universal share (they call it the switching method) [11], which is very similar to a probabilistic version of Herbster and Warmuth's fixed share algorithm, except that they put a prior on the unknown parameter, so that their algorithm adaptively learns the optimal value during prediction. In formula: P n Pus (xn ) = fs, (x )w() d . p,0 B p,1 B p,2 B p,3 B C C C C D D D D predict almost as well as the best partition where the best expert is selected per segment. They give two algorithms called fixed share and dynamic share. The second algorithm does not fit in our framework; furthermore its motivation applies only to loss functions other than log-loss. We focus on fixed share, which is in fact identical to our algorithm applied to the HMM depicted in Figure 6, where all arcs into the silent states have fixed probability [0, 1] and all arcs from the silent states have some fixed distribution w on .2 The same algorithm is also described as an instance of the Aggregating Algorithm in [12]. Fixed share reduces to fixed elementwise mixtures by setting = 1 and to Bayesian mixtures by setting = 0. Formally, using Q = Qs Qp : P (p, 0) = 1 ( , n) = w( ) p, n , n + 1 P , n p, n = 1- , n , n + 1 Qs = {p} × N Qp = × Z+ ( 1 3 a) In [1], Bousquet shows that the overhead for not knowing the optimal parameter value is equal to the overhead of estimating a Bernoulli parameter: let Lfs, be as before, and let Lus = - log Pus (xn ) denote the loss of universal share with Jeffreys' prior w() = -1/2 (1 - )-1/2 / . Then Lus - min Lfs, 1 + 1 2 (1 3 b ) Each productive state represents that a particular expert is used at a certain sample size. Once a transition to a silent state is made, all history is forgotten and a new expert is chosen according to w.3 ^ Let L denote the loss achieved by the best partition, with switching rate := m/(n - 1). Let Lfs, denote the loss of fixed share with uniform w and parameter . Herbster and Warmuth prove4 ^ Lfs, - L (n - 1)H ( , ) + (m - 1) log(|| - 1) + log|| , which we for brevity loosen slightly to ^ Lfs, - L nH ( , ) + m log|| . (1 4 ) log n. (1 5 ) t P Thus Pus is universal for the model class fs, | [0, 1] hat consists of all ES-joints where the ES-priors are distributions with a fixed switching rate. Universal share requires quadratic running time O(n2 ||), restricting its use to moderately small data sets. In [8], Monteleoni and Jaakkola placa discrete prior on the parameter e that divides its mass over n well-chosen points, in a setting where the ultimate sample size n is known beforehand. This way they still manage to achieve (5) up to a constant, while 1 reducing the running time to O(n n||). 1 The HMM for universal share with the 2 , 1 Dirichlet 2 prior on the switching rate is displayed in Figure 7. It is formally specified (using Q = Qs Qp ) by: Q m Qs = {p, q} × m, n N2 | m n ( , m, n) = × , n N2 | m < n P (p, 0, 0) = 1 p= w( ) q m, n , m, n + 1 p, , m, n p, m + 1, n 1 n P = (m + 1 ) , m, n q, m, n 2 n 1 , m, n , m, n + 1 (n - m - 2 ) (1 6 ) Each productive state , n, m represents the fact that at sample size n expert is used, while there have been m switches in the past. Note that the last two lines of (16) are subtly different from the corresponding topmost line of (12). In a sample of size n there are n possible positions to use a given expert, while there are only n - 1 possible switch positions. The presence of the switch count in the state is the new ingredient compared to fixed share. It allows us to adapt the switching probability to the data, but it also renders the number of states quadratic. We discuss reducing the number of states without sacrificing much performance in [6]. Here H ( , ) = - log - (1 - ) log(1 - ) is the cross entropy. The best loss guarantee is obtained for = , in which case the cross entropy reduces to the binary entropy H (). A drawback of the method is that the optimal 2 This is actually a slight generalisation: the original algorithm uses a uniform w( ) = 1/||. 3 Contrary to the original fixed share, we allow switching to the same expert. In the HMM framework this is necessary to achieve running-time O(n||). Under uniform w, non-reflexive switching with fixed rate can be simulated by reflexive switching with fixed | rate = |||1 (provided 1). For non-uniform w, the rate - becomes expert-dependent. 4 This bound can be obtained for the fixed share HMM using the previous footnote. 5 New Models to Switch between Experts So far we have considered two models for switching between experts: fixed share and its generalisation, universal share. Figure 7 Universal share A,1,2 A B,1,2 p,1,1 A A B C,1,2 p,1,2 q,1,2 B p,1,3 q,1,3 B C D,1,2 C C D D D A,0,1 First, we let the probability of switching to a different expert decrease with the sample size. This allows us to derive a loss bound close to that of the fixed share algorithm, without the need to tune any parameters.5 Second, the switch distribution has a special provision to ensure that in the case where the number of switches remains bounded, the incurred loss overhead is O(1). The switch distribution was introduced in [10], which addresses a long standing open problem in statistical model class selection known as the "AIC vs BIC dilemma". Here we disregard such applications and treat the switch distribution like the other models for combining expert predictions. In §5.1.1, we describe an HMM that corresponds to the switch distribution; this illuminates the relationship between the switch distribution and the fixed share algorithm which it in fact generalises. We provide a loss bound for the switch distribution in §5.1.2. 5.1.1 Switch HMM Let and be sequences of distributions on {0, 1} which we call the switch and stabilisation probabilities. The switch HMM Asw , displayed in Figure 8, is defined below using Q = Qs Qp : × p N P (p, 0) = 1 (s, , n) = Qs = , ps , pu Qp = {s, u} × × Z+ p p p, n u , n p p , n s, n pu , n u, , n + 1 P s ,n s, , n + 1 s u, , n s, , n + 1 u, , n u, , n + 1 , , n p, n (u, , n) = n (0) n (1) w( ) = w( ) 1 (0) n n (1) A B,0,1 p,0,0 A A A B C,0,1 q,0,1 B q,0,2 B q,0,3 B C D,0,1 C C C D D D D While fixed share is an extremely efficient algorithm, it requires that the frequency of switching between experts is estimated a priori, which can be hard in practice. Moreover, we may have prior knowledge about how the switching probability will change over time, but unless we know the ultimate sample size in advance, we may be forced to accept a linear overhead compared to the best parameter value. Universal share overcomes this problem by marginalising over the unknown parameter, but has quadratic running time. The first model considered in this section, the switch distribution, avoids both problems. It is parameterless and has essentially the same running time as fixed share. It also achieves a loss bound competitive to that of universal share. Moreover, for a bounded number of switches the bound has even better asymptotics. The second model is called the run-length model because it uses a run-length code (c.f. [7]) as an ES-prior. This may be useful because, while both fixed and universal share model the distance between switches with a geometric distribution, the real distribution on these distances may be different. This is the case if, for example, the switches are highly clustered. This additional expressive power comes at the cost of quadratic running time, but we discuss a special case where this may be reduced to linear. We conclude this section with a comparison of the two expert switching models. 5.1 Switch Distribution The switch distribution is a recent model for combining expert predictions. Like fixed share, it is intended for settings where the best predicting expert is expected to change as a function of the sample size, but it has two major innovations. This HMM contains two "expert bands". Consider a productive state u, , n in the bottom band, which we call the unstable band, from a generative viewpoint. Two things can happen. With probability n (0) the process continues horizontally to u, , n + 1 and the story repeats. We say that no switch occurs. With probability n (1) the process continues to the silent state p, n directly to the right. We say that a a switch occurs. Then a new choice has to be made.pWith p ( robability n (0) the process continues rightward to u , n u nd then branches out to some productive state , , n + 1 possibly = ), and the storp repei ts. With probability y a n (1) the process continues to s , n n the top band, called the stasle band. Also here it branches out to some productive b . state , , n + 1 But from this point onward there are no choices anymore; expert is produced forever. We say that the process has stabilised. By choosing n (1) = 0 and n (1) = for all n we essentially remove the stable band and arrive at fixed share with parameter . The presence of the stable band enables us to improve the loss bound of fixed share in the particular 5 The idea of decreasing the switch probability as 1/(n + 1), which has not previously been published, was independently conceived by Mark Herbster and the authors. Figure 8 The switch distribution A A A A The prior T may be written T (n) = T (ti |ti > ti-1 ) = 1 n-1 1 - n , so that ps ,0 ¸ B s,C,1 ¸ ps ,1 B ¸ ps ,2 B ps ,3 ¸ B C C C C If we substitute this in the last term of (17), the sum telescopes and we are left with - log(t1 ) + log(tm ) + = 0 1/(ti (ti - 1)) 1 n >ti-1 n-1 - im 1 n = ti-1 . ti (ti - 1) D D D D =2 log(ti - 1). (1 8 ) A A A A pu ,0 ¸ B u,C,1 pu ,1 ¸ B ¸ pu ,2 p,2 B pu ,3 ¸ B p,0 C p,1 C C p,3 C If we fix tm , this expression is maximised if t2 , . . . , tm-1 take on the values tm - m + 2, . . . , tm - 1, so that (18) b eco m es + = t t i tm m m! log(m!). log log i = log m (tm - m)! =t - m +1 m The theorem follows using this upper bound. D D D D case that the number of switches is bounded; in that case, the stable band allows us to remove the dependency of the loss bound on n altogether. We will use the particular choice n (0) = 1/ 2 for all n, and n (1) = T (Z = n|Z n) an arbitrary distribution T on N. This allows us to relate the switch HMM to the parametric representation that we present next. 5.1.2 A Loss Bound We derive a loss bound of the same type as the bound for the fixed share algorithm (see §4.2). We need the following lemma, that is proven in [6]. Lemma 5. Fix an expert sequence n . Let m denote the number of blocks in n , where the blocks are the maximal subsequences containing only a single expert. Let 1 = t1 < t2 < ˇ ˇ ˇ < tm n be the indices where the blocks start. Then im w(ti )T (Z = ti |Z > ti-1 ). sw ( n ) 2-m w(1 ) =2 n Note that this loss bound is a function of the index of the last switch tm rather than of the sample size n; this means that in the important scenario where the number of switches remains bounded in n, the loss compared to the best partition is O(1). The bound compares quite favourably with the loss bound for the fixed share algorithm (see §4.2). We can tighten our bound slightly by using the fact that we allow switches to the same expert, as also remarked in Footnote 3 on page 8. For brevity we do not pursue this here, but the difference is exactly that between (14) and the original bound for the fixed share algorithm. We now investigate how much worse the above guarantees are compared to (14). The overhead of fixed share is bounded from above by nH () + m log(||). We first underestimate this worst-case loss by substituting the optimal value = m/n, and rewrite n. nH () nH (m/n) log m Second we overestimate the loss of the switch distribution by substituting the worst case tm = n. We then find the maximal difference between the two bounds to be m - n+ + m log|| + log log(m!) m l = n+ og m log|| m m + log(m!) m + m log m. (19) Thus using the switch distribution instead of fixed share lowers the guarantee by at most m + m log m bits, which is significant only if the number of switches is relatively large. On the flip side, using the switch distribution does not require any prior knowledge about the data (i.e. the maximum likelihood switching rate). This is a big advantage in a setting where we desire to maintain the bound sequentially. This is impossible with the fixed share algorithm in case the optimal value of varies with n. Theorem 6. Fix data x . Let n maximise the likelihood Pn (xn ) among all expert sequences with m blocks. Let tm be the index of the first element of the last block in n . Let T (n) = 1/(n(n - 1)) and w be uniform. Then the loss overhead - log Psw (xn ) + log Pn (xn ) of the switch distribution is bounded by + t m log(m!). m + m log|| + log m Proof. We have - log Psw (xn ) + log Pn (xn ) - log sw ( n ) im T (ti |ti > ti-1 )w(ti ) - log 2-m w(1 ) =2 = m + m log|| - im =2 log T (ti |ti > ti-1 ). (1 7 ) 5.2 Run-length Model Run-length codes have been used extensively in the context of data compression, see e.g. [7]. Rather than applying run length codes directly to the observations, we reinterpret the corresponding probability distributions as ES-priors, because they may constitute good models for the distances between consecutive switches. The run length model is especially useful if the switches are clustered, in the sense that some blocks in the expert sequence contain relatively few switches, while other blocks contain many. The fixed share algorithm remains oblivious to such properties, as its predictions of the expert sequence are based on a Bernoulli model: the probability of switching remains the same, regardless of the index of the previous switch. Essentially the same limitation also applies to the universal share algorithm, whose switching probability normally converges as the sample size increases. The switch distribution is efficient when the switches are clustered toward the beginning of the sample: its switching probability decreases in the sample size. However, this may be unrealistic and may introduce a new unnecessary loss overhead. The run-length model is based on the assumption that the intervals between successive switches are independently distributed according to some distribution T . After the universal share model and the switch distribution, this is a third generalisation of the fixed share algorithm, which is recovered by taking a geometric distribution for T . As may be deduced from the defining HMM, which is given below, we require quadratic running time O(n2 ||) to evaluate the runlength model in general. 5.2.1 Run-length HMM m , Let S := , n N2 | m < n and let T be a distribution on Z+ . The specification of the run-length HMM is given using Q = Qs Qp by: Qs = {q} × S {p} × N Qp = × S , n, n + 1 , m, n + 1 q, m, n p, n ( , m, n) = P (p, 0) = 1 Figure 9 The run-length model A A p,2 B q,2,3 B C C D D A A A p,1 B C,1,2 q,1,2 B q,1,3 B C C C D D D A A A A p,0 B C,0,1 q,0,1 B C,0,2 q,0,2 B q,0,3 B C C C C D D D D Proof. Let i denote the length of block i. We overestimate - log Prl (xn ) + log Pn (xn ) - log rl ( n ) = m log|| + m log|| + m- 1 i =1 im =1 5.2.2 A Loss Bound p, n , m, n P q, m, n , m, n - log T (Z = i ) - log T (Z m ) (2 0 ) w( ) (Z > n|Z n) T = T (Z = n|Z n) 1 - log T (i ). Fix an expert sequence n with m blocks. For i = 1, . . . , m, let i and ki denote the length and expert of block i. From the definition of the HMM above, we obtain that rl ( n ) equals im - log w(ki ) + m- 1 i =1 Since - log T is concave, by Jensen's inequality we have n. im - log T (i ) i m i = - log T - log T m m m =1 =1 In other words, the block lengths i are all equal in the worst case. Plugging this into (20) we obtain the theorem. 5.2.3 Finite Support We have seen that the run-length model reduces to fixed share if the prior on switch distances T is geometric, so that it can be evaluated in linear time in that case. We also obtain a linear time algorithm when T has finite support, because then only a constant number of states can receive positive weight at any sample size. For this reason it can be advantageous to choose a T with finite support, even if one expects that arbitrarily long distances between consecutive switches may =1 - log T (Z = i ) - log T (Z m ). Theorem 7. Fix data xn . Let n maximise the likelihood Pn (xn ) among all expert sequences with m blocks. Let w be the uniform distribution on experts, and let T be logconvex. Then the loss overhead is bounded thus l n. - log Prl (xn ) + log Pn (xn ) m og || - log T m occur. Expert sequences with such longer distances between switches can still be represented with a truncated T using a sequence of switches from and to the same expert. This way, long runs of the same expert receive exponentially small, but positive, probability. 5.3 Comparison We have discussed two models for switching: the recent switch distribution and the new run-length model. It is natural to wonder which model to apply. One possibility is to compare asymptotic loss bounds. To compare the bounds given by Theorems 6 and 7, we substitute tm + 1 = n in the bound for the switch distribution, and use a prior T for the run-length model that satisfies - log T (n) log n + 2 log log(n + 1) + 3 (for instance an Elias code [3]). The next step is to determine which bound is better depending on how fast m grows as a function of n. It only makes sense to consider m non-decreasing in n. Theorem 8. The loss bound of the switch distribution (with tn = n) is asymptotically lower than( that of the run-length 2, model (with T as above) if m = o log n) and asymp( .6 2 totically higher if m = log n) Proof sketch. After eliminating terms common to both loss bounds, it remains to compare n + m + m log m to 2m log log +1 3. m silent states. The standard algorithms for HMMs (Forward, Backward, Viterbi and Baum-Welch) can be used to answer questions about the ES-prior as well as the induced distribution on data. The running time of the forward algorithm can be read off directly from the graphical representation of the HM M . Our approach allows unification of many existing expert models, including mixture models and fixed share. We gave their defining HMMs and recovered the best known running times. We also introduced two new parameterless generalisations of fixed share. The first, called the switch distribution, was recently introduced to improve model selection performance. We rendered its as a small HMM, which shows how it can be evaluated in linear time. The second, called the runlength model, uses a run-length code in a novel way, namely as an ES-prior. This model has quadratic running time. We compared the loss bounds of the two models asymptotically, and showed that the run-length model is preferred if the num2 ber of switches grows like (log n) or faster, while the switch distribution is preferred if it grows slower. We provided graphical representations and loss bounds for all considered models. Acknowledgements Peter Grunwald's and Tim van Erven's suggestions signifi¨ cantly improved the quality of this paper. Thanks also go to Mark Herbster for a fruitful and enjoyable afternoon exchanging ideas, which has certainly influenced the shape of this paper. If m is bounded, the left hand side is clearly lower for sufficiently large n. Otherwise we may divide by m, exponentiate, simplify, and compare m to (log n - log m)2 , References [1] O. Bousquet. A note on parameter tuning for on-line shifting algorithms. Technical report, Max Planck Institute for Biological Cybernetics, 2003. [2] A. P. Dawid. Statistical theory: The prequential approach. Journal of the Royal Statistical Society, Series A, 147, Part 2:278­292, 1984. [3] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194­203, 1975. [4] M. Herbster and M. K. Warmuth. Tracking the best expert. In Proceedings of the 12th Annual Conference on Learning Theory (COLT 1995), pages 286­294, 1995. [5] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151­178, 1998. [6] W. M. Koolen and S. de Rooij. Combining expert advice efficiently. arXiv:0802.2015, Feb 2008. [7] A. Moffat. Compression and Coding Algorithms. Kluwer Academic Publishers, 2002. [8] C. Monteleoni and T. Jaakkola. Online learning of non-stationary sequences. Advances in Neural Information Processing Systems, 16, 2003. [9] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, volume 77, issue 2, pages 257­285, 1989. [10] T. van Erven, P. D. Grunwald, and S. de Rooij. Catching up faster in Bayes¨ ian model selection and model averaging. In To appear in Advances in Neural Information Processing Systems 20 (NIPS 2007), 2008. [11] P. Volf and F. Willems. Switching between two universal source coding algorithms. In Proceedings of the Data Compression Conference, Snowbird, Utah, pages 491­500, 1998. [12] V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35:247­282, 1999. [13] Q. Xie and A. Barron. Asymptotic minimax regret for data compression, gambling and prediction. IEEE Transactions on Information Theory, 46(2):431­445, 2000. from which the theorem follows directly. For finite samples, the switch distribution can be used in case the switches are expected to occur early on average, or if the running time is paramount. Otherwise the run-length model is preferable. 6 Conclusion In prediction with expert advice, the goal is to formulate prediction strategies that perform as well as the best possible expert (combination). Expert predictions can be combined by taking a weighted mixture at every sample size. The best weights generally evolve over time. In this paper we introduced expert sequence priors (ES-priors), which are probability distributions over infinite sequences of experts, to model the trajectory followed by the optimal mixture weights. Prediction with expert advice then amounts to marginalising the joint distribution constructed from the chosen ES-prior and the experts' predictions. We employed hidden Markov models (HMMs) to specify ES-priors. HMMs' explicit notion of current state and stateto-state evolution naturally fit the temporal correlations we seek to model. For reasons of efficiency we use HMMs with Let f , g : N N. We say f = o(g ) if limn f (n)/g (n) = 0. We say f = (g ) if c > 0n0 n n0 : f (n) cg (n). 6