Catching Up Faster in Bayesian Model Selection and Model Averaging

¨ Tim van Erven Peter Grunwald Steven de Rooij Centrum voor Wiskunde en Informatica (CWI) Kruislaan 413, P.O. Box 94079 1090 GB Amsterdam, The Netherlands {Tim.van.Erven,Peter.Grunwald,Steven.de.Rooij}@cwi.nl

Abstract
Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify the catch-up phenomenon as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch-distribution, a modification of the Bayesian marginal distribution. We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dil emma. The method is practical; we give an efficient implementation.

1

Introduction

We consider inference based on a countable set of models (sets of probability distributions), focusing on two tasks: model selection and model averaging. In model selection tasks, the goal is to select the model that best explains the given data. In model averaging, the goal is to find the weighted combination of models that leads to the best prediction of future data from the same source. An attractive property of some criteria for model selection is that they are consistent under weak conditions, i.e. if the true distribution P  is in one of the models, then the P  -probability that this model is selected goes to one as the sample size increases. BIC [14], Bayes factor model selection [8], Minimum Description Length (MDL) model selection [3] and prequential model validation [5] are examples of widely used model selection criteria that ar e usually consistent. However, other model selection criteria such as AIC [1] and leave-one-out cross-validation (LOO) [16], while often inconsistent, do typically yield better predictions. This is especially the case in nonparametric settings, where P  can be arbitrarily well-approximated by a sequence of distr ibutions in the (parametric) models under consideration, but is not itself conta ined in any of these. In many such cases, the predictive distribution converges to the true distribu tion at the optimal rate for AIC and LOO [15, 9], whereas in general BIC, the Bayes factor method and prequential validation only achieve the optimal rate to within an O(log n) factor [13, 20, 6]. In this paper we reconcile these seemingly conflicting approaches [19] by improving the rate of converg ence achieved in Bayesian model selection without losing its convergence properties. First w e provide an example to show why Bayes sometimes converges too slowly. Given priors on models and parameters therein, Bayesian inf erence is based on the posterior distribution that is obtained by conditioning on observed outcomes. In model selection the preferred model is the one with maximum a posteriori probability. In predict ion the marginal distributions p1 , p2 , . . . (defined as in (1) below) are weighted according to the posterior, a process called Bayesian Model Averaging (BMA). We denote the resulting distribution pbma . 1


In a sequential setting, the probability of a data sequence xn := x1 , . . . , xn under a distribution p typically decreases exponentially fast in n. It is therefore common to consider - log p(xn ), which we call the codelength of xn achieved by p. This name refers to the correspondence between codelength functions and probability distributions based on the Kraft inequality, but one may also think of the codelength as the accumulated log loss that is incurred if we sequentially predict the xi by conditioning on the past, i.e. using p(· | xi-1 ) [3, 6, 5, 11]. From here on all logarithms are taken to base 2, allowing us to measure codelength in bits. Prediction using pbma has the advantage that the codelength it achieves on xn is close to the code^ length of pk , where k is the index of best of the marginals p1 , p2k . . . Namely, given a prior w on , ^ model indices, the difference between - log pbma (xn ) = - log( pk (xn )w(k )) and - log pk (xn ) ^ ^ must be in the range [0, - log w(k )], whatever data xn are observed. Thus, using BMA for prediction is sensible if we are satisfied with doing essentially as well as the best model under consideration. However, it is often possible to combine p1 , p2 , . . . into a distribution that achieves ^ smaller codelength than pk ! This is possible if the index k of the best distribution changes with ^ the sample size in a predictable way. This is common in model selection, for example with nested models, say M1  M2 . In this case p1 typically predicts better at small sample sizes (roughly, because M2 has more parameters that need to be learned than M1 ), while p2 predicts better eventually. Figure 1 illustrates this phenomenon. It shows the accumulated codelength difference - log p2 (xn ) - (- log p1 (xn )) on "The Picture of Dorian Gray" by Oscar Wilde, where p1 and p2 are the Bayesian marginal distributions for the first-order and second-order Markov chains, respectively, and each character in the book is an outcome. Note that the example models M1 and M2 are very crude; for this particular application much better mod els are available. In more complicated, more realistic model selection scenarios, the models may st ill be wrong, but it may not be known how to improve them. Thus, M1 and M2 serve as a simple illustration only. We used uniform priors on the model parameters, but for other common priors similar behaviour can be expected. Clearly p1 is better for about the first 100 000 outcomes, gaining a head start of approximately 40 000 bits. Ideally we should predict the initial 100 000 outcomes using p1 and the rest using p2 . However, pbma only starts to behave like p2 when it catches up with p1 at a sample size of about 310 000, when the codelength of p2 drops below that of p1 . Thus, in the shaded area pbma behaves like p1 while p2 is making better predictions of those outcomes: since at n = 100 000, p2 is 40 000 bits behind, and at n = 310 000, it has caught up, in between it must have outperformed p1 by 40 000 bits! The general pattern that first one model is 60000 Markov order 2 better and then another occurs widely, both Bayesian Model Averaging Switch-Distribution 40000 on real-world data and in theoretical settings. We argue that failure to take this 20000 effect into account leads to the suboptimal 0 rate of convergence achieved by Bayes fac-20000 tor model selection and related methods. We have developed an alternative method -40000 to combine distributions p1 and p2 into a -60000 single distribution psw , which we call the -80000 switch-distribution, defined in Section 2. Figure 1 shows that psw behaves like p1 -100000 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 initially, but in contrast to pbma it starts Sample size to mimic p2 almost immediately after p2 Figure 1: The Catch-Up Phenomenon starts making better predictions; it essentially does this no matter what sequence xn is actually observed. psw differs from pbma in that it is based on a prior distribution on sequences of models rather than simply a prior distribution on models. This allows us to avoid the implicit assumption that there is one model which is best at all sample sizes. After conditioning on past observations, the posterior we obtain gives a better indication of which model performs best at the current sample size, thereby achieving a faster rate of convergence. Indeed, the switch-distribution is related to earlier algorithms for tracking the best expert developed in the universal prediction literature [7, 18, 17 , 10]; however, the applications we have in mind and the theorems we prove are completely differe nt. In Sections 3 and 4 we show that model selection based on the switch-distribution is consistent (Theorem 1), but unlike standard Bayes factor model selection achieves optimal rates of convergence (Theorem 2). Proofs of the theorems are in Appendix A. In Section 5 we give a practical algorithm that computes the switchCodelength difference with Markov order 1 (bits)

2


distribution for K (rather than 2) predictors in (n · K ) time. In the full paper, we give further details of the proof of Theorem 1 and a more detailed discussi on of Theorem 2 and the implications of both theorems.

2

The Switch-Distribution for Model Selection and Prediction

Preliminaries Suppose X  = (X1 , X2 , . . .) is a sequence of random variables that take values in sample space X  Rd for some d  Z+ = {1, 2, . . .}. For n  N = {0, 1, 2, . . .}, let xn = (x1 , . . ., xn ) denote the first n outcomes of X  , such that xn takes values in the product space X n =  X1 × · · · × Xn . (We let x0 denote the empty sequence.) Let X  = n=0 X n . For m > n, we write m Xn+1 for (Xn+1 , . . ., Xm ), where m =  is allowed and we omit the subscript when n = 0. Any distribution P (X  ) may be defined by a sequential prediction strategy p that predicts the next outcome at any time n  N. To be precise: Given the previous outcomes xn at time n, this prediction strategy should issue a conditional density p(Xn+1 |xn ) with corresponding distribution P (Xn+1 |xn ) for the next outcome Xn+1 . Such sequential prediction strategies are sometimes call ed prequential forecasting systems [5]. An instance is given in Example 1 below. We assume that the density p(Xn+1 |xn ) is taken relative to either the usual Lebesgue measure (if X is continuous) or the counting measure (if X is countable). In the latter case p(Xn+1 |xn ) is a probability mass function. It is natural to define the joint density p(xm |xn ) = p(xn+1 |xn ) · · · p(xm |xm-1 ) and let m  P (Xn+1 |xn ) be the unique distribution such that, for all m > n, p(Xn+1 |xn ) is the density of its m  marginal distribution for Xn+1 . To ensure that P (Xn+1 |xn ) is well-defined even if X is continuous, we impose the natural requirement that for any k  Z+ and any fixed event Ak+1  Xk+1 the probability P (Ak+1 |xk ) is a measurable function of xk , which holds automatically if X is countable. Model Selection and Prediction The goal in model selection is to choose an explanation for observed data xn from a potentially infinite list of candidate models M1 , M2 , . . . We consider parametric models, which are sets {p :   } of prediction strategies p that are indexed by elements of   Rd , for some smallest possible d  N, the number of degrees of freedom. Examples of model selection are regression based on a set of basis func tions such as polynomials (d is the number of coefficients of the polynomial), the variable selection problem in regression [15, 9, 20] (d is the number of variables), and histogram density estimati on [13] (d is the number of bins). A model selection criterion is a function  : X   Z+ that, given any data sequence xn  X  , selects the model Mk with index k =  (xn ). We associate each model Mk with a single prediction strategy pk . The bar emphasizes that pk is a Ż Ż meta-strategy based on the prediction strategies in Mk . In many approaches to model selection, for ^ example AIC and LOO, pk is defined using some estimator k for each model Mk , which maps a Ż n sequence x of previous observations to an estimated parameter value th at represents a "best guess" of the true/best distribution in the model. Prediction is then based on this estimator: pk (Xn+1 | Ż xn ) = pk (xn ) (Xn+1 | xn ), which also defines a joint density pk (xn ) = pk (x1 ) · · · pk (xn |xn-1 ). Ż Ż Ż ^ The Bayesian approach to model selection or model averaging goes the other way around. We start out with a prior w on k , and define the Bayesian marginal density  pk (xn ) = Ż p (xn )w() d. (1)
 k

When pk (xn ) is non-zero this joint density induces a unique conditional density pk (Xn+1 | xn ) = Ż Ż pk (Xn+1 , xn )/pk (xn ), which is equal to the mixture of p  Mk according to the posterior, Ż Ż pn n w(|xn ) = p (xn )w()/  (x )w ( ) d , based on x . Thus the Bayesian approach also defines a prediction strategy pk (Xn+1 |xn ), whose corresponding distribution may be thought of as Ż an estimator. From now on we sometimes call the distribution s induced by p1 , p2 , . . . "estimators", ŻŻ even if they are Bayesian. This unified view is known as prequential or predictive MDL [11, 5]. Example 1. Suppose X = {0, 1}. Then a prediction strategy p may be based on the Bernoulli Ż model M = {p |   [0, 1]} that regards X  as a sequence of independent, identically distributed Bernoulli random variables with P (Xn+1 = 1) = . We may predict Xn+1 using the maximum n ^ likelihood (ML) estimator based on the past, i.e. using (xn ) = n-1 i=1 xi . The prediction for ^ x1 is then undn fined. If we use a smoothed ML estimator such as the Laplace estimator,  (xn ) = e (n + 2)-1 ( i=1 xi + 1), then all predictions are well-defined. Perhaps surprising ly, the predictor 3


p defined by p (Xn+1 | xn ) = p (xn ) (Xn+1 ) equals the Bayesian predictive distribution based on Ż Ż ^ a uniform prior. Thus in this case a Bayesian predictor and an estimation-based predictor coincide! The Switch-Distribution Suppose p1 , p2 , . . . is a list of prediction strategies for X  . (Although here the list is infinitely long, the developments below can with little modification be adjusted to the case where the list is finite.) We first define a family Q = {qs : s  S} of combinator prediction strategies that switch between the original prediction str ategies. Here the parameter space S is defined as S = {(t1 , k1 ), . . . , (tm , km )  (N × Z+ )m | m  Z+ , 0 = t1 < . . . < tm }. (2) The parameter s  S specifies the identities of m constituent prediction strategies and the sample   sizes, called switch-points, at which to switch between them. For s = ((t , k1 ), . . . , (t  , km )), we 1 m    define ti (s) = ti , ki (s) = ki and m(s) = m . We omit the argument when the parameter s is clear from context, e.g. we write t3 for t3 (s). For each s  S the corresponding qs  Q is defined as:   pk1 (Xn+1 |xn ) if n < t2 ,    p (X n   n+1 |x ) if t2  n < t3 , k2  . . . . qs (Xn+1 |xn ) = (3) . .   p n  km-1 (Xn+1 |x ) if tm-1  n < tm ,    p (X n n+1 |x ) if tm  n. km

Switching to the same predictor multiple times is allowed. The extra switch-point t1 is included to simplify notation; we always take t1 = 0. Now the switch-distribution is defined as a Bayesian mixture of the elements of Q according to a prior  on S:

Definition 1 (Switch-Distribution). Let  be a probability mass function on S. Then the switchdistribution Psw with prior  is the distribution for X  such that, for any n  Z+ , the density of its marginal distribution for X n is given by s qs (xn ) ·  (s). (4) psw (xn ) =
S

Although the switch-distribution provides a general way to combine prediction strategies, in this paper it will only be applied to combine prediction strategi es p1 , p2 , . . . that correspond to models. ŻŻ In this case we may define a corresponding model selection criterion sw . To this end, let Kn+1 : S  Z+ be a random variable that denotes the strategy/model that is used to predict Xn+1 given past observations xn . Formally, Kn+1 (s) = ki (s) iff ti (s)  n and i = m(s)  n < ti+1 (s). Algorithm 1, given in Section 5, efficiently computes the posterior distribution on Kn+1 given xn : s { qs (xn ) s : K n + 1 ( s )= k }   (Kn+1 = k | xn ) = , (5) psw (xn ) which is defined whenever psw (xn ) is non-zero. We turn this into a model selection criterion sw (xn ) = arg maxk  (Kn+1 = k |xn ) that selects the model with maximum posterior probability.

3

Consistency

If one of the models, say with index k  , is actually true, then it is natural to ask whether sw is consistent, in the sense that it asymptotically selects k  with probability 1. Theorem 1 below states that this is the case under certain conditions which are only slightly stronger than those required for standard Bayes factor model selection consistency. Ż Ż Bayes factor model selection is consistent if for all k , k  = k , Pk (X  ) and Pk (X  ) are mutually  Żk (A) = 1 and Pk (A) = 0 [3]. Ż singular, that is, if there exists a measurable set A  X such that P For example, this can usually be shown to hold if the models are nested and for each k , k is a subset of k+1 of wk+1 -measure 0 [6]. For consistency of sw , we need to strengthen this to the requirement   Ż Ż that, for all k  = k and all xn  X  , the distributions Pk (Xn+1 | xn ) and Pk (Xn+1 | xn ) are mutually singular. For example, if X1 , X2 , . . . are i.i.d. according to each P in all models, but also Ż if X is countable and pk (xn+1 | xn ) > 0 for all k , all xn+1  X n+1 , then this conditional mutual Ż Ż singularity is automatically implied by ordinary mutual si ngularity of Pk (X  ) and Pk (X  ). 4


Let Es = {s  S | m(s ) > m(s), (ti (s ), ki (s )) = (ti (s), ki (s)) for i = 1, . . . , m(s)} denote the set of all possible extensions of s to more switch-points. Let p1 , p2 , . . . be Bayesian prediction ŻŻ strategies with respective parameter spaces 1 , 2 , . . . and priors w1 , w2 , . . ., and let  be the prior of the corresponding switch-distribution. Theorem 1 (Consistency of the Switch-Distribution). Suppose  is positive everywhere on {s  S | m(s) = 1} and is such that there exists a positive constant c such that, for every s  S,   Ż Ż c ·  (s)   (Es ). Suppose further that Pk (Xn+1 | xn ) and Pk (Xn+1 | xn ) are mutually singular  +  n   + for all k , k  Z , k = k , x  X . Then, for all k  Z , for all   k except for a subset of k of wk -measure 0, the posterior distribution on Kn+1 satisfies  (Kn+1 = k  | X n ) - 1
n

with P -probability 1.

(6)

The requirement that c ·  (s)   (Es ) is automatically satisfied if  is of the form: im T (ti |ti > ti-1 )K (ki ),  (s) = M (m)K (k1 )
=2

(7)

where M , K and T are priors on Z+ with full support, and M is geometric: M (m) = m-1 (1 - ) for some 0   < 1. In this case c = /(1 - ).

4

Optimal Risk Convergence Rates

Suppose X1 , X2 , . . . are distributed according to P  . We define the risk at sample size n  1 of the Ż estimator P relative to P  as Ż Rn (P  , P ) = EX n-1 P  [D(P  (Xn = · | X n-1 ) P (Xn = · | X n-1 ))], Ż where D(· ·) is the Kullback-Leibler (KL) divergence [4]. This is the standard definition of risk Ż relative to KL divergence. The risk is always well-defined, and equal to 0 iff P (Xn+1 | X n ) is equal to P  (Xn+1 | X n ). The following identity connects information-theoretic expected redundancy and accumulated statistical risk (see [4] or [6, Chapter 15] ): If P  admits a density p , then for all prediction strategies p, Ż in n  n Ż n P  [- log p(X ) + log p (X )] = Ri (P  , P ). (8) EX Ż
=1

For a union of parametric models M = 1 Mk , we define the information closure M = {P  | inf P M D(P  P ) = 0}, i.e. the set of distributions for X  that can be arbitrarily well approximated by elements of M. Theorem 2 below shows that, for a very large class of P   M , ŻŻ the switch-distribution defined relative to estimators P1 , P2 , . . . achieves the same risk as any other model selection criterion defined with respect to the same es timators, up to lower order terms; in other words, model averaging based on the switch-distribut ion achieves at least the same rate of convergence as model selection based on any model selection criterion whatsoever (the issue of averaging vs selection will be discussed at length in the full paper). The theorem requires that the prior  in (4) is of the form (7), and satisfies - log M (m) = O(m) ; - log K (k ) = O(log k ) ; - log T (t) = O(log t). (9) Thus, M , the prior on the total number of switch points, is allowed to decrease either polynomially or exponentially (as required for Theorem 1); T and K must decrease polynomially. For example, we could set T (t) = K (t) = 1/(t(t + 1)), or we could take the universal prior on the integers [12]. Let M  M be some subset of interest of the information closure of mode l M. M may consist of just a single, arbitrary distribution P  in M \ M ­ in that case Theorem 2 shows that the switchdistribution converges as fast as any other model selection criterion on any distribution in M that cannot be expressed parametrically relative to M ­ or it may be a large, nonparametric family. In that case, Theorem 2 shows that the switch-distribution achieves the minimax convergence rate. For example, if the models Mk are k -bin histograms [13], then M contains every distribution on [0, 1] with bounded continuous densities, and we may, for example, take M to be the set of all distributions on [0, 1] which have a differentiable density p such that p (x) and (d/dx)p (x) are bounded from below and above by some positive constants. We restrict ourselves to model selection criteria which, at sample size n, never select a model Mk with k > n for some arbitrarily large but fixed  > 0; note that this condition will be met for most 5

k


practical model selection criteria. Let h : Z+  R+ denote the minimax optimal achievable risk as a function of the sample size, i.e. Ż (10) h(n) = inf sup sup Rn (P  , P ), n 
 :X {1,2,...,n } P  M n n

The requirement that nh(n)/(log n)2   will typically be satisfied whenever M \ M is nonempty. Then M contains P  that are "nonparametric" relative to the chosen sequence of models M1 , M2 , . . . Thus, the problem should not be "too simple": we do not know whether the theorem holds in the parametric setting where P   Mk for some k on the list. Theorem 2 expresses that the accumulated risk of the switch-distribution, as n increases, is not significantly larger than the accumulated risk of any other procedure. This "convergence in sum" has been co nsidered before by, for example, [13, 4], and is compared to ordinary convergenc e in the full paper, where we will also give example applications of the theorem and further discus s (10). The proof works by bounding the expected redundancy of the switch-distribution, which, by (8), is identical to the accumulated risk. It is not clear whether similar techniques can be used to bound the individual risk.

where the infimum is over all model selection criteria restri cted to sample size n, and · denotes rounding up to the nearest integer. p is the prediction strategy satisfying, for all n  n, all Ż     xn  X n , p (Xn +1 | xn ) := p(xn ) (Xn +1 | xn ), i.e. at sample size n it predicts xn+1 using Ż Ż n pk for the k =  (X ) chosen by  , and it keeps predicting future xn +1 by this k . We call h(n) Ż the minimax optimal rate of convergence for model selection relative to data from M , model list ŻŻ M1 , M2 , . . ., and estimators P1 , P2 , . . . The definition is slightly nonstandard, in that we require a  second supremum over n  n. This is needed because, as will be discussed in the full paper, it can Ż Ż sometimes happen that, for some P  , some k , some n > n, Rn (P  , Pk ) > Rn (P  , Pk ) (see also [4, Section 7.1]). In cases where this cannot happen, such as regression with standard Mn estimators, L Ż Ż and in cases where, uniformly for all k , supn n Rn (P  , Pk ) - Rn (P  , Pk ) = o( i=1 h(i)), (in the full paper we show that this holds for, for example, histo gram density estimation), our Theorem 2 also implies minimax convergence in terms of the standard de finition, without the supn n . We expect that the supn n can be safely ignored for most "reasonable" models and estim ators. Theorem 2. Define Psw for some model class M = k1 Mk as in (4), where the prior  satisfies (9). Let M be a subset of M with minimax rate h such that nh(n) is increasing, and nh(n)/(log n)2  . Then n supP  M i=1 Ri (P  , Psw ) n lim sup  1. (11) n i=1 h(i)

5

Computing the Switch-Distribution

Algorithm 1 sequentially computes the posterior probabili ty on predictors p1 , p2 , . . .. It requires that  is a prior of the form in (7), and M is geometric, as is also required for Theorem 1 and permitted in Theorem 2. The algorithm resembles F I X E D - S H A R E [7], but whereas F I X E D - S H A R E implicitly imposes a geometric distribution for T , we allow general priors by varying the shared weight with n. We do require slightly more space to cope with M . Algorithm 1 S W I T C H(xN )  K is the number of experts;  is as in the definition of M . a b for k = 1, . . . , K do initialise wk   · K (k ); wk  (1 - ) · K (k ) od a Report prior  (K1 ) = wK1 (a K -sized array) for n = 1, . . . , N do a a b b for k = 1, . . . , K do wk  wk · pk (xn |xn-1 ); wk  wk · pk (xn |xn-1 ) od (loss update) ka pool  T (Z = n | Z  n) · wk (share update) for k = 1, . . . , K do a a wk  wk · T (Z = n | Z  n) +  · pool · K (k ) + (1 - ) · pool · K (k ) od ka b b a (wk + wk ) Report posterior  (Kn+1 | xn ) = (wKn+1+ wKn+1 )/
b b wk  wk

(a K -sized array) od This algorithm can be used to obtain fast convergence in the sense of Theorem 2, which can be extended to cope with a restriction to only the first K experts. Theorem 1 can be extended to show 6


consistency in this case as well. If T (Z = n | Z  n) and K (k ) can be computed in constant time, then the running time is (N · K ), which is of the same order as that of fast model selection cri teria like AIC and BIC. We will explain this algorithm in more detail in a forthcoming publication. Acknowledgements We thank Y. Mansour, whose remark over lunch at COLT 2005 spar ked off ¨ all this research. We thank P. Harremo es and W. Koolen for mathematical support. This work was supported in part by the IST Programme of the European Commun ity, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors' views.

A Proofs
Proof of Theorem 1. Let Un = {s  S | Kn+1 (s) = k  } denote the set of `bad' parameters s that select an incorrect model. It is sufficient to show that s s qs (X n )  Un  Ż s s lim =0 with Pk -probability 1. (12) n qs (X n ) S 

To see this, suppose the theorem is false. Then there exists a   k with wk () > 0 such that Ż (6) does not hold for any   . But then by definition of Pk we have a contradiction with (12). Now let A = {s  S : km (s) = k  } denote the set of parameters that are bad for sufficiently large n. We observe that for each s  Un there exists at least one element s  A that uses the same sequence of switch-points and predictors on the first n + 1 outcomes (this implies that Ki (s) = Ki (s ) for i = 1, . . . , n + 1) and has no switch-points beyond n (i.e. tm (s)  n). Consequently, either s = s or s  Es . Therefore s s s  (s )qs (xn )  ( (s) +  (Es )) qs (xn )  (1 + c)  (s)qs (xn ). (13)
 U n

A

A

Defining the mixture r(x ) =
n

n

r(X ) Ż =0 with Pk -probability 1. (14)  (s = (0, k  )) · pk (X n ) Ż s n  n Using (13) and the fact that Ż S  (s)qs (x )   (s = (0, k )) · pk (x ), this implies (12). For   Ż all s  A and xtm (s)  X tm (s) , by definition Qs (Xtm +1 |xtm ) equals Pkm (Xtm +1 |xtm ), which is Żk (X  |xtm ) by assumption. If X is a separable metric space, which mutually singular with P tm +1 holds because X  Rd for some d  Z+ , it can be shown that this conditional mutual singularity Ż implies mutual singularity of Qs (X  ) and Pk (X  ). To see this for countable X , let Bxtm be any   t t Ż tm |x m ) = 1 and Pk  (Bxtm |x m ) = 0. Then, for B = {y event such that Qs (Bx  X  | ytm +1  Żk (B ) = 0. In the uncountable case, however, B may not be Bytm }, we have that Qs (B ) = 1 and P ¨ measurable. We omit the full proof, which was shown to us by P. Harremoes. Any countable mixture of distributions that are mutually singular with Pk , in particular R, is mutually singular with Pk . This implies (14) by Lemma 3.1 of [2], which says that for any two mutually singular distributions R and P , the density ratio r(X n )/p(X n ) goes to 0 as n   with P -probability 1. lim Proof of Theorem 2. We will show that for every  > 1, in in in  sup h(i) + ,n Ri (P , Psw )   h(i),
P  M =1 =1 =1 n

s

A

 (s)qs (x ), we will show that
n

n

(15)

where ,n - 0, and ,1 , ,2 , . . . are fixed constants that only depend on , but not on the chosen subset M of M . Theorem 2 is a consequence of (15), which we will proceed to prove. Let n : X n  {1, . . . , n } be a model selection criterion, restricted to samples of siz e n, that is minimax optimal, i.e. it achieves the infimum in (10). If such a n does not exist, we take a n that is almost minimax optimal in the sense that it achieves the infimum to within h(n)/n. For j  1, let tj = j -1  - 1. Fix an arbitrary n > 0 and let m be the unique integer such that tm < n  tm+1 . We will first show that for arbitrary xn , psw achieves redundancy not much worse than qs with s = (t1 , k1 ), . . . , (tm , km ), where ki = ti (xti ). Then we show that the redundancy of this qs is small enough for (15) to hold. Thus, to achieve this redundancy, it is sufficient to take only a logarithmic number m - 1 of switch-points: m - 1 < log (n + 1). Formally, we have, for some c > 0, uniformly for all n, xn  X n , 7


- log psw (x ) = - log

n

s

q (x ) (s )  - log qs (x ) - log M (m) -
s
 S

n


n

jm

log T (tj )K (kj )

=1

 - log qs (xn ) + c log(n + 1) + cm( + 1) log n = - log qs (xn ) + O((log n)2 ). (16) Here the second inequality follows because of (9), and the final equality follows because m  log (n + 1) + 1. Now fix any P   M . Since P   M , it must have some density p . Thus, applying (8), and then (16), and then (8) again, we find that in Ri (P  , Psw ) = EX n P  [- log psw (X n ) + log p (X n )]
=1

 EX n P  [- log qs (X n ) + log p (X n )] + O((log n)2 ) = in Ri (P  , Qs ) + O((log n)2 ) = jm
min{tj +1 ,n}

=1

=1

i

Ż Ri (P  , Pkj ) + O((log n)2 ). (17)

=tj +1

Ż For i appearing in the second sum, with tj < i  tj +1 , we have Ri (P  , Pkj )  Ż Ż supi tj +1 Ri (P , Pkj ) = supi tj +1 Ri (P , Pt (xtj ) )  h(tj + 1), so that
j

1 1 tj +1 Ż Ri (P  , Pkj )  · (tj + 1)h(tj + 1)  · ih(i)  h(i)  h(i), tj + 1 tj + 1 tj + 1 where the middle inequality follows because nh(n) is increasing (condition (b) of the theorem). m min{t + ,n} n Ż Summing over i, we get j =1 i=tj +j1 1 Ri (P  , Pkj )   i=1 h(i). Combining this with n n (17), it follows that i=1 Ri (P  , Psw )   i=1 h(i) + O((log n)2 ). Because this holds for arbitrary P   M (with the constant in the O notation not depending on P  ), (15) now follows by the requirement of Theorem 2 that nh(n)/(log n)2  .

References
[1] H. Akaike. A new look at statistical model identification. IEEE T. Automat. Contr., 19(6):716­723, 1974. [2] A. Barron. Logically Smooth Density Estimation. PhD thesis, Stanford University, Stanford, CA, 1985. [3] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE T. Inform. Theory, 44(6):2743­2760, 1998. [4] A. R. Barron. Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems. In Bayesian Statistics 6, pages 27­52, 1998. [5] A. P. Dawid. Statistical theory: The prequential approach. J. Roy. Stat. Soc. A, 147, Part 2:278­292, 1984. ¨ [6] P. D. Grunwald. The Minimum Description Length Principle. The MIT Press, 2007. [7] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151­178, 1998. [8] R. E. Kass and A. E. Raftery. Bayes factors. J. Am. Stat. Assoc., 90(430):773­795, 1995. [9] K. Li. Asymptotic optimality of cp , cl , cross-validation and generalized cross-validation: Discrete index set. Ann. Stat., 15:958­975, 1987. [10] C. Monteleoni and T. Jaakkola. Online learning of non-stationary sequences. In Advances in Neural Information Processing Systems, volume 16, Cambridge, MA, 2004. MIT Press. [11] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE T. Inform. Theory, IT-30(4): 629­636, 1984. [12] J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific, 1989. [13] J. Rissanen, T. P. Speed, and B. Yu. Density estimation by stochastic complexity. IEEE T. Inform. Theory, 38(2):315­323, 1992. [14] G. Schwarz. Estimating the dimension of a model. Ann. Stat., 6(2):461­464, 1978. [15] R. Shibata. Asymptotic mean efficiency of a selection of regression variables. Ann. I. Stat. Math., 35: 415­423, 1983. [16] M. Stone. An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion. J. Roy. Stat. Soc. B, 39:44­47, 1977. [17] P. Volf and F. Willems. Switching between two universal source coding algorithms. In Proceedings of the Data Compression Conference, Snowbird, Utah, pages 491­500, 1998. [18] V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35:247­282, 1999. [19] Y. Yang. Can the strengths of AIC and BIC be shared? Biometrica, 92(4):937­950, 2005. [20] Y. Yang. Model selection for nonparametric regression. Statistica Sinica, 9:475­499, 1999.

8