Continuous-Time Belief Propagation

Tal El-Hay1 TALE @ CS . HUJI . AC . IL Ido Cohn1 IDO COHN @ CS . HUJI . AC . IL Nir Friedman1 NIR @ CS . HUJI . AC . IL Raz Kupferman2 RAZ @ MATH . HUJI . AC . IL 1 School of Computer Science and Engineering, 2 Institute of Mathematics, Hebrew University, Jerusalem 91904, Israel

Abstract
Many temporal processes can be naturally modeled as a stochastic system that evolves continuously over time. The representation language of continuous-time Bayesian networks allows to succinctly describe multi-component continuous-time stochastic processes. A crucial element in applications of such models is inference. Here we introduce a variational approximation scheme, which is a natural extension of Belief Propagation for continuous-time processes. In this scheme, we view messages as inhomogeneous Markov processes over individual components. This leads to a relatively simple procedure that allows to easily incorporate adaptive ordinary differential equation (ODE) solvers to perform individual steps. We provide the theoretical foundations for the approximation, and show how it performs on a range of networks. Our results demonstrate that our method is quite accurate on singly connected networks, and provides close approximations in more complex ones.

the study of efficient computer representations, inference, and learning of complex continuous-time processes is still in its early stages. Continuous-time Bayesian networks (CTBNs) (Nodelman et al., 2002) provide a sparse representation of complex multi-component processes by describing how the dynamics of an individual component depends on the state of its neighbors. A major challenge is translating the structure of a CTBN to computational gains in inference problems--answering queries about the process from partial observations. As exact inference in a CTBN is exponential in the number of components, we have to resort to approximations. Broadly speaking, these fall into two main categories. The first category includes stochastic approximations (Fan & Shelton, 2008; El-Hay et al., 2008), which sample trajectories of the process. While these can be asymptotically exact, they can be computationally expensive and incur computational penalties when sampling rapidly evolving subprocesses. The second category of approximations includes variational methods. Nodelman et al. (2005) and Saria et al. (2007) developed an approach based on expectation propagation (Minka, 2001; Heskes & Zoeter, 2002), where the posterior distribution over a process is approximated by piecewise homogeneous factored processes. This involves an elaborate message passing scheme between the approximations for different components, and an adaptive procedure for determining how to segment each time interval. More recently, we introduced a mean-field approximation (Cohn et al., 2009), which uses factored inhomogeneous processes (Opper & Sanguinetti, 2007). This allowed us to build on the rich literature of adaptive ODE solvers. While the mean-field approximation provides a lower-bound on the likelihood, it suffers from the expected drawbacks when approximating highly coupled processes. Here we introduce a variational approximation that combines insights from both previous approaches for variational inference in CTBNs. Our approximation is a natural extension of the successful Bethe approximation (Yedidia et al., 2005) to CTBNs. Alternatively, it can be viewed

1. Introduction
The dynamics of many real-life processes are naturally modeled in terms of continuous-time stochastic processes, allowing for a wide range of time scales within the same process. Examples include biological sequence evolution(Felsenstein, 2004), computer systems (Xu & Shelton, 2008; Simma et al., 2008), and social networks (Fan & Shelton, 2009). While the mathematical foundations of continuous-time stochastic processes are well understood (Chung, 1960),
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Continuous-Time Belief Propagation

applying the approach of Nodelman et al where the segment length diminishes to zero. Our approximation finds a collection of inhomogeneous processes over subsets of components, which are constrained to be locally consistent over single components. We show that this approximation is often accurate on tree-networks, and provides good approximations for more complex networks. Importantly, the approximation scheme is simple and allows to easily exploit the large suites of computational tools offered in the field of ODEs.

that combines the conditional rate matrices as follows:  i|Pa  qx ,y i|u  i i i i|Pai = i qxi ,xi |ui   0 x,y = {i} x=y otherwise, (1)

qx,y

where x,y = {i|xi = yi }. This definition implies that changes occur one component at a time. Given a continuous-time Bayesian network, we would like to evaluate the likelihood of evidence, to compute the probability of various events given the evidence (e.g., that the state of the system at time t is x), and to compute conditional expectations (e.g., the expected amount of time Xi was in state xi ). Direct computations of these quantities involve matrix exponentials of the rate matrix Q, whose size is exponential in the number of components, making this approach infeasible beyond a modest number of components. We therefore have to resort to approximations.

2. Continuous-Time Bayesian Networks
Consider a d-component Markov process X (t) = (t) (t) (t) (X1 , X2 , . . . Xd ) with state space S = S1 × S2 × · · · × Sd . A notational convention: vectors are denoted by boldface symbols, e.g., X, and matrices are denoted by blackboard style characters, e.g., Q. The states in S are denoted by vectors of indexes, x = (x1 , . . . , xd ). We use indexes 1  i, j  d for enumerating components and X (t) and (t) Xi to denote the random variable describing the state of the process and its i'th component at time t. The dynamics of a time-homogeneous continuous-time Markov process are fully determined by the Markov transition function, px,y (t) = Pr(X (t+s) = y|X (s) = x), where time-homogeneity implies that the right-hand side does not depend on s. These dynamics are captured by a matrix Q--the rate matrix, with non-negative off-diagonal entries qx,y and diagonal entries qx,x = - y=x qx,y . The rate matrix is related to the transition function by d px,y (t) dt = qx,y .
t=0

3. A Variational Principle
Variational inference methods pose the inference task in terms of an optimization problem. The objective is to maximize a functional which lower-bounds the log probability of the evidence by introducing an auxiliary set of variational parameters (Wainwright & Jordan, 2008). Recently, we introduced a variational formulation of inference in continuous-time Markov processes. We start by reviewing the relevant results of Cohn et al. For convenience we restrict our treatment to an interval [0, T ] with boundary evidence X (0) = e0 and X (T ) = eT . The posterior distribution of a homogeneous Markov process given evidence e = {e0 , eT } on the two boundaries is a non-homogeneous Markov process. Such a process can be represented using a time varying rate matrix Q(t) that describe the instantaneous transition rates. However, such a representation is unwieldy, since as t approaches T the transition rates from x = eT to eT approach infinity. To circumvent the problem of unbounded values near the boundaries, we introduced marginal density sets which represent the posterior process in terms of uni-variate and joint pairwise distributions. More formally, if Pr denotes the posterior distribution, its marginal density set is the following family of continuous functions: µx (t) = Pr(X (t) = x) Pr(X (t) = x, X (t+h) = y) , x = y. h0 h (2) In addition to providing a bounded representation to the posterior, this representation allows to easily compute exx,y (t) = lim

The probability of being in state x at time t satisfies the master equation (Chung, 1960) d P r(X (t) = x) = dt qy,x P r(X (t) = y).
y

A continuous-time Bayesian network is a structured multicomponent continuous-time Markov process. It is defined by assigning each component i a set of components Pai  {1, . . . , d} \ {i}, which are its parents in the network (Nodelman et al., 2002). With each component i we i|Pa then associate a family of rate matrices Q·|ui i , with entries qxi ,yii|ui , that describe the rates of change of the i'th component given the state ui of the parents Pai . The dynamics of X (t) are defined by a rate matrix Q with entries qx,y
i|Pa

Continuous-Time Belief Propagation

pected sufficient statistics using numerical integration:
t t

and an entropy functional
T

E [Tx (t)] =
0

µx (s)ds,

E [Mx,y (t)] =
0

x,y (s)ds,

H() =
0 x y=x

x,y (t)[1+ln µx (t)-ln x,y (t)]dt .

where Tx (t) is the residence time in state x in the interval [0, t], and Mx,y (t) is the number of transitions from x to y in the same interval. Thus, this representation is analogous to sets of mean parameters that are employed in variational approximations over exponential families with a finite dimensional parametrization (Wainwright & Jordan, 2008; Koller & Friedman, 2009). Families of functions µ,  that satisfy (2) for some Pr, must satisfy self-consistent relations imposed by the master equation. Definition 3.1 : (Cohn et al., 2009) A family  = {µx (t), x,y (t) : 0  t  T } of continuous functions is a Markov-consistent density set if the following constraints are fulfilled: µx (t)  0,
x

To illustrate this principle, we can examine how to derive an exact inference procedure. We can find the optimum of F(; Q) by introducing Lagrange multipliers that enforce the consistency constraint, and then find the stationary point of the corresponding Lagrangian. Since we are dealing with a continuous-time formula, we need to use the Euler-Lagrange method . As we show, the maximum satisfies a system of differential equations: d x = - dt d µx = dt qx,y y
y

x (T ) = 1 x=eT µx (0) = 1 x=e0 y = x, x = 0, (3)

(y,x - x,y ),
y=x

µx (0) = 1, y = x, (y,x (t) - x,y (t)) .

x,y = µx qx,y

y , x

x,y (t)  0 d µx (t) = dt

where we omit the (t) argument for clarity. The auxiliary functions x (t) are Lagrange multipliers. These equations have a simple intuitive solution that involves backward integration of x (t), as we have a boundary condition at time T and x (t) does not depend on µx (t). This integration results in x (t) = Pr(eT |X (t) = x) Once we solve for x (t), we can forward integrate µx (t) from the boundary conditions at 0 to get the solution for µx and x,y . This analysis suggests that this system of ODEs is similar to forward-backward propagation, except that unlike classical forward propagation, here the forward propagation takes into account the backward messages to directly compute the posterior. Note that applying this exact solution to a multi-component process results in an exponential (in d) number of coupled differential equations.

y=x

and x,y (t) = 0 whenever µx (t) = 0. For convenience, we define x,x = - y=x x,y . The evidence at the boundaries impose additional constraints on potential posterior processes. Specifically, the representation  corresponding to the posterior distribution PQ (·|e0 , eT ) is in the set Me that contains Markov-consistent density sets {µx (t), x,y (t)}, that satisfy µx (0) = 1 x=e0 , µx (T ) = 1 x=eT and xy (T ) = 0 for all y = eT . In addition, since these sets are posteriors of a CTBN, they also change one component at a time, which implies that x,y (t) = 0 if |x,y | > 1. Using this representation, the variational formulation is reminiscent of similar formulations for discrete probabilistic models (Jordan et al., 1998). Theorem 3.2: (Cohn et al., 2009) Let Q be a rate matrix and e = (e0 , eT ) be states of X. Then ln PQ (eT |e0 ) = max F(; Q),
Me

4. Continuous-Time Expectation Propagation
Approximate variational inference procedures are derived by posing an optimization problem that is an approximate version of the original one. Different approximations differ in terms of whether they approximate the objectives, limit or relax the allowed set of solutions, or combine several such approaches. Here, we follow a strategy which is based on the approach of expectation propagation, in which the set of admissible solutions is extended to ones that are consistent only on the expectations of statistics of interest, and in addition, use an approximate functional.

where F(; Q) = E(; Q) + H(), is the free energy functional which is a sum of an average energy functional  
T

E(; Q) =
0 x

µx (t)qx,x +
y=x

x,y (t) ln qx,y  dt,

Continuous-Time Belief Propagation

~ Let Me be the set of locally consistent densities that correspond to evidence e The local consistency of   and  i does not imply that the distribution Pri (Xi ) is equal to the marginal distribution Pr (Xi ), as marginalization of a Markov process is not necessarily a Markov process. Rather, Pri is the projection of Pr (Xi ) to a Markov process with the matching expectations of E [Txi (t)] and E [Mxi ,yi (t)] (Koller & Friedman, 2009). Such locally consistent sets allow us to construct a tractable approximation to the variational optimization problem by introducing the continuous-time Bethe functional ~  F(~ ; Q) = Ei ( (i) ; Qi|Pai ) +
i 

Figure 1. A CTBN and a corresponding factor graph.

4.1. Approximate Optimization Problem To represent potential solutions, we follow methods used in recent approximate inference procedures that use factor graph representations (Yedidia et al., 2005; Koller & Friedman, 2009). Specifically, we keep only marginal density sets over smaller clusters of components. We start with definitions and notations. A factor graph is an undirected bipartite graph. One layer in the graph consists of component nodes that are labeled by component indexes. The second layer consists of clusters nodes A , where each cluster   A, is a subset of {1, . . . , d}. The edges in the graph are between a component node i to a cluster node  if and only if i  . Thus, the neighbors of  are N () = {i : i  } and the neighbors of i are N (i) = { : i  }. A factor graph is family preserving, with respect to a given CTBN, if there exists an assignment function (i) that maps components to clusters, such that for every i, we have that {i}  Pai  (i). We denote by A() the set of components i for which (i) = . From now on, we assume that we deal only with family preserving factor graphs. Example 4.1: Figure 1 shows a simple CTBN and a corresponding factor graph. In this specific factor graph, A(a) = {1, 2}, A(b) = {3} and A(c) = {4}. Given a factor graph, we use its structure to define an approximation for a distribution. Instead of describing the distribution over all the components, we use a family of ~ density sets  = { i : i = 1, . . . , d}  {  :   A}. A family of marginal density sets can be inconsistent. We do not require full consistency, but only consistency between neighboring nodes in the following sense. ~ Definition 4.2: A family of density sets  is said to be locally consistent if for all   A and all i  N () we have µi = µ |i where ( µ |i )xi =
x\i

H(  ) -
i

ci H( i )

where, for  = (i), Ei (  ; Qi|Pai ) =
T 0 x

µ (t)qxi ,xii|ui + x
y=x

i|Pa

 x,y (t) ln qxi ,yii|ui dt,

i|Pa

and ci = N (i) - 1 ensure that the total weight of sets containing component i sums up to 1. This functional is analogous to the well-known Bethe approximation for discrete models (Yedidia et al., 2005). Combining the two approximations the approximate optimization problem becomes:
~ ~  Me

~  max F(~ ; Q)

(6)

Once the optimal parameters are found, we can use the relevant marginal density set to answer queries. 4.2. Stationary Point Characterization To characterize the stationary points of the approximate optimization problem (6) we use again the Euler-Lagrange method, where we introduce Lagrange multiplier funcd tions to enforce the cluster-wise constraints, dt µ = x   y=x (y  ,x - x ,y  ) as well as the local consistency constraints defined in equations (4) and (5). Differentiating the Lagrangian, equating the derivatives to zero, and performing some algebra, which we omit for the lack of space, we obtain fixed-point equations that consist of the initial constraints and two classes of coupled equations. The first class consists of equations similar to (3), which refer to the dynamics within each cluster. To simplify the presentation, we introduce some definitions. Definition 4.3: Assume we are given a time-varying matrix function G(t), and boundary conditions x0 and xT . Define

µ \i ,xi ] [x

(4)

and [x\i , xi ] is the assignment to x composed from x\i and xi . Likewise,  i =   |i where (   |i )xi ,yi =
x\i  [x\i ,xi ],[x\i yi ] .

(5)

Continuous-Time Belief Propagation

the operator  = R(G, x0 , xT ) to return  = (µ, ), the unique solution of the following ODEs d x = - dt d µx = dt gx,y y ,
y

4.3. Gauge Transformation To overcome these numerical difficulties, we now derive an alternative characterization, which does not suffer from unbounded values. We start with a basic result. Proposition 4.4: Let G be a unnormalized rate matrix function, and let x (t) be a smooth positive vector-valued function, where x (t) > 0 in [0, T ). Let G to be the matrix function with
 gxy =

x (T ) = 1 x=xT µx (0) = 1 x=x0 x = 0, y = x.

(y,x - x,y ),
y=x

x,y = µx gx,y

y , x

Note that this set of equations is identical to (3), but replaces the constant rate matrix Q by a time varying matrix function G(t). Using this terminology, the first part of the fixed-point equations is   = R(G , e0 | , eT | ), where G (t) is the time-dependent matrix with entries
 gx ,y =  i|Pai (qx y |u )1 iA() · nii  i i i xi ,y 

gxy · x y d gxx - dt log x

y=x y = x.

(10)

Then, R(G, x0 , xT ) = R(G , x0 , xT ). Proof sketch: Let ,  satisfy the system of equations of Def. 4.3 with G. Define  =  · , and show that  ,  satisfy the same system of equations with G .

(7)

This result characterizes transformations of (8­9) that do x ,y = {i} not change the fixed point solutions for cluster density sets. i|Pai We seek transformations that reweigh the functions ni, so i x = y  iN () 1 iA() qxi xi |ui + nxi ,xi  that they remain bounded using the following result.  0 otherwise, Proposition 4.5: Assume G is a unnormalized rate matrix and ni, are time-dependent functions that originate from function such that gx,y (t) = 0 for all x, y, gx,y (t) is conthe Lagrange multipliers that enforce local consistency tinuously differentiable in [0, T ], and  = R(G, x0 , xT ). constraints, If (t) is a family of smooth functions satisfying x (T ) = ci d i 1 x=xT and dt x (T ) < 0 for x = xT , then xi yi nii = , xi = yi xi ,y µi i x N (i) x,y (t) x (t) (9) lim < , x = y i tT µx (t) y (t) xi xi i nxi ,xi = ci i . µxi and
N (i)

(8)

These equations together with, (4) and (5) form the second set of equations that couple different clusters. Equation (7) suggests that the matrix G plays the role of a rate matrix. Unlike Q, G is not guaranteed to be a rate matrix as its rows do not necessarily sum up to zero. Nonetheless, even though it is not a rate matrix, this system of equations has a unique solution that can be found using a backward-forward integration. Thus, since the matrix function G corresponds to a unique density set, we say that G is an unnormalized parametrization of the process P . At this point, it is tempting to proceed to construct a message passing algorithm based on this fixed point characterization in analogous manner to the developments of Yedidia et al. (2005) . However, we are faced with a problem. Note xi that limtT µxei = . Therefore, according to Equation i (9), when t approaches T , there exists some   N (i) for which ni, T ,i (t) approaches  as t  T . This implies that xi e a simple-minded message passing procedure is susceptible to unbounded values and numerical difficulties.

tT

lim

d x,x (t) - log x (t) µx (t) dt

< ,

x.

Example 4.6: One function that satisfies the conditions of Proposition 4.5 is x (t) = 1 - t/T, x = eT and eT (t) = 1.
i Using this result, we introduce weight functions xi  i ci /(ci +1) (as above) and define x = . iN () (xi ) Using these weight functions, we define mii = xi ,y d i i i nii ( xi )ci /(ci +1) and mxi ,xi = nii - cic+1 dt log xi . i xi ,y xi ,x yi ~ Now if we define the time-dependent matrix G with entries i

gx ,y = ~  i|Pai (qx y |u )1 iA() · mii  xi ,y  i i i
iN ()

(11) x ,y = {i} + mii xi ,x x = y  otherwise,

i|Pai 1 iA() qxi xi |ui

  0

Continuous-Time Belief Propagation

Algorithm 1 Continuous-Time Belief Propagation Initialize messages: for all  and all i  N () Choose    M e Compute  i using (14) Set mii = 1 xi = yi , mii = 0 xi ,x xi ,y repeat Choose a cluster : 1. i  N (), set mii using (15) xi ,y ~ 2. Update G using (11) ~ 3. Compute   from G using (12) 4. i  N () compute  i using (14) until convergence ~ then G = (G ) . By Proposition 4.4, ~   = R(G , e0 | , eT | ). (12)


where we write

i x y i i µi x i

=

i x y i i

µi xi

once for each .

Plugging the definition of mii and mii into (9) we get xi ,x xi ,y mii = xi ,y
N (i) i i xi yi xi i i µxi yi ci

,

xi = yi (13) .

mii xi ,x
N (i)

= ci

i xi xi d i - log xi i µxi dt

The algorithm is summarized in Algorithm 1. The implementation of these steps involve a few details. We start with the initialization of messages. The only free parameter is the initial values of   . To ensure that these initial choices are in M , we choose initial rates, and perform e computations to get a valid posterior for the clusters. Another degree of freedom is the order of cluster updates. We use a randomized strategy, choosing a cluster at random, and if one of its neighbors was updated since it was last chosen, we update it. The computation in Step 3, involves reverse integration followed by forward integration (as explained in Section 3) . We gain efficiency by using adaptive numerical integration procedures. Specifically, we use the Runge-Kutta-Fehlberg (4,5) method. This method chooses temporal evaluation points on the fly, and returns values at these points. The computations of Step 2 is done on demand only at the evaluation points. To allow efficient interpolation, we use a piecewise linear approximation of  whose boundary points are determined by the evaluation points that are chosen by the adaptive integrator. Finally, as might be expected, we do not have convergence guarantees. However, if the algorithm converges, the fixed point equations are satisfied, hence giving a stationary point (hopefully a local maximum) of problem (6).

If the preconditions of Proposition 4.5 are satisfied, the terms in (13) are bounded. Together (11)­(13) provide an alternative characterization of the fixed point(s) of the optimization problem. 4.4. Message Passing Scheme We now use the above characterization as justification for a message passing scheme, that if converged, will satisfy the fixed point equations. While (11) and (12) are readily transformed into assignments, (13) poses a challenge. We start by noting that (13) contains the terms µi i and x i xi ,yi . We can get these terms from   for any   N (i). Thus, for   N (i), we define µi = µ |i  i =   |i (14)

5. Experiments
We tested our method on three representative network structures: a directed tree, a directed toroid, and a bidirectional ring (Fig. 2). The tree network does not have any cycles. The toroid network has cycles, but these are fairly long, whereas the bidirectional ring has multiple short cycles. All networks are parametrized as dynamic Ising models, in which neighboring components prefer to be in the same state. Specifically, we use the conditional rates qxi ,yii|ui =  1 + exp -2yi 
i|Pa -1 jPai

xj

We view these as the messages from cluster  to the component i. At convergence, µi = µi for ,   N (i), but this is not true before convergence. Next, we rewrite (13) as an assignment  i i 1 xi yi xi   xi = yi  i   N (i) mii µi yi xi ,y xi  = mii = xi ,y i  xi xi d  i  - log xi - mii else,  xi ,x  dt  N (i) µi xi
=

where xj  {-1, 1},  is a coupling parameter and  is a rate parameter. Small values of  correspond to weak coupling, whereas  determines how fast components tend to switch states. For each experiment we set evidence at times 0 and 1 (Fig. 2, left panel). We compare the Bethe approximation to exact inference and mean-field (Cohn et al., 2009). We start by comparing the value of sufficient statistics (residence time and number of transitions of each component for each state of its parents) computed by each method. For example, for a particular choice of  and  , (Fig. 2 middle columns) we can see that the Bethe approximation is virtually exact on the tree model and the toroid, but has some bias on the bidirectional ring model. These scatter plots also shed light

(15)

Continuous-Time Belief Propagation Network Residence time Number of Transitions Estimated log-likelihood

Figure 2. Simulation results for a tree network (top row), a toroid network (middle), and a bidirectional chain (bottom). Left network structure and the evidence at start and end points; black is +1 and white is -1. Middle-left: scatter plot of expected conditional residence times for networks with  = 1,  = 8. Each point corresponds to a single statistic, the x-axis is the exact value and the y-axis is the approximate value. Middle-right: same for expected conditional transition numbers. Right: exact and approximations of log-likelihood as function of , the strength of coupling between components ( = 8).

on the nature of the difference between the two methods. Specifically, in the most likely scenario, two components switch from -1 to 1 near the beginning and the other two switch from 1 to -1 near the end, and so through most of the interval all the components are in the same state. The mean-field algorithm gives a uni-modal solution, focusing on the most likely scenario, resulting in zero residence time for the less likely states. These states are represented by the points close on the x-axis. The Bethe representation on the other hand can capture multiple scenarios. Another aspect of the approximation is the estimation of the likelihood. In Fig. 2 (right column) we compare these estimations as function of , the problem hardness. Again, we see that the Bethe approximation is essentially exact on the tree network, and provides close approximations in the two other networks. When we push  and  to extreme values we do see inaccuracies even in the tree network, showing that the algorithm is an approximation. While the ODE solvers used here allow adaptive integration error control, we do not have an a-priory control on the propagation of this error. To test this effect on overall accuracy, we repeated these experiments using standard grid refinement. Specifically, we computed integrals using uniformly spaced evaluation points and systematically halving integration intervals until no changes in the output were observed. Final results of these tests were practically

the same as those obtained using adaptive integration.

Figure 3. Run time vs. the number components in the three networks types ( = 1,  = 8).

Next, we examine how the algorithm scales with the number of components in the networks. In all three networks we see that the magnitude of relative error is essentially independent of the number of components (not shown). Fig. 3 shows that the run time scales linearly with the number of components. In harder networks the algorithm requires more iterations leading to slower convergence.

6. Discussion
Here, we introduce a message passing scheme that provides a variational approximation for CTBNs. This scheme is derived from an approximate free energy functional, based on the principles of expectation-propagation, where we require the posteriors on clusters to be locally consistent in

Continuous-Time Belief Propagation

terms of the Markovian projections on individual components. We show that stationary points of this optimization problem are the fixed points of a message passing algorithm, whose structure closely mirrors Belief Propagation. In contrast to Belief Propagation on discrete (atemporal) networks, our algorithm is not guaranteed to be exact on tree CTBNs. The source of inaccuracy is the projection of the marginal distributions over components onto Markov processes. While this projection looses information, our empirical results suggest that this approximation is relatively accurate. The works that are closest to ours are those of Nodelman et al. (2005) and Saria et al. (2007) which are also derived from an expectation-propagation energy functional. The main difference between the two methods is the structure of the approximation. Nodelman et al use a piecewise homogeneous representation, allowing them to represent the rate in each homogeneous segment by a constant (conditional) rate matrix. This, however, requires introducing machinery for deciding how to segment each component. As Saria et al show, this choice can have dramatic impact on the quality of the approximation and the running time. In contrast, our approach uses a (continuously) inhomogeneous representation, which is essentially the limit when segment sizes tend to zero. Surprisingly, rather than making the problem more complex, this choice simplifies the mathematics and also the implementation. In particular, our solution decouples the probabilistic issues (dependencies between components) and numerical issues (adaptive integration) and allows us to build on well-understood methods from numerical integration for efficient and adaptive selection of the number and placement of discretization points. Our results show how a careful choice of representations and operations over them can narrow the gap between inference methods in discrete and continuous-time graphical models. Our constructions can be naturally generalized to capture more complex dependencies using methods based on Generalized Belief Propagation (Yedidia et al., 2005).

Mean field variational approximation for continuoustime Bayesian networks. UAI, 2009. El-Hay, T., Friedman, N., and Kupferman, R. Gibbs sampling in factorized continuous-time markov processes. UAI, 2008. Fan, Y. and Shelton, C. R. Learning continuous-time social network dynamics. UAI, 2009. Fan, Y. and Shelton, C.R. Sampling for approximate inference in continuous time Bayesian networks. In AI & Math, 2008. Felsenstein, J. Inferring Phylogenies. Sinauer, 2004. Heskes, T. and Zoeter, O. Expectation propagation for approximate inference in dynamic Bayesian networks. UAI, 2002. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., and Saul, L. K. An introduction to variational approximations methods for graphical models. In Learning in Graphics Models. 1998. Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. 2009. Minka, T. P. Expectation propagation for approximate Bayesian inference. UAI, 2001. Nodelman, U., Shelton, C. R., and Koller, D. Continuous time Bayesian networks. UAI, 2002. Nodelman, U., Shelton, C.R., and Koller, D. Expectation propagation for continuous time Bayesian networks. UAI, 2005. Opper, M. and Sanguinetti, G. Variational inference for Markov jump processes. NIPS, 2007. Saria, S., Nodelman, U., and Koller, D. Reasoning at the right time granularity. UAI, 2007. Simma, A., Goldszmidt, M., MacCormick, M., Barham, M., Black, M., Isaacs, M., and Mortier, R. CT-NOR: Representing and reasoning about events in continuous time. UAI, 2008. Wainwright, M. J. and Jordan, M. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1:1­305, 2008. Xu, J. and Shelton, C. R. Continuous time Bayesian networks for host level network intrusion detection. ECML/PKDD, 2008. Yedidia, J.S., Freeman, W.T., and Weiss, Y. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Info. Theory, 51:2282­ 2312, 2005.

Acknowledgments
We thank Ofer Meshi and the anonymous reviewers for helpful remarks on previous versions of the manuscript. This research was supported in part by a grant from the Israel Science Foundation. Tal El-Hay is supported by the Eshkol fellowship from the Israeli Ministry of Science.

References
Chung, K.L. Markov chains with stationary transition probabilities. 1960. Cohn, I., El-Hay, T., Friedman, N., and Kupferman, R.