Modeling Interleaved Hidden Pro cesses

Niels Landwehr niels.landwehr@cs.kuleuven.be Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A 3001 Heverlee, Belgium

Abstract
Hidden Markov models assume that observations in time series data stem from some hidden process that can be compactly represented as a Markov chain. We generalize this model by assuming that the observed data stems from multiple hidden processes, whose outputs interleave to form the sequence of observations. Exact inference in this model is NP-hard. However, a tractable and effective inference algorithm is obtained by extending structured approximate inference methods used in factorial hidden Markov models. The proposed model is evaluated in an activity recognition domain, where multiple activities interleave and together generate a stream of sensor observations. It is shown to be more accurate than a standard hidden Markov model in this domain.

current activity from a stream of dense sensor data. In many situations, users switch back and forth between multiple activities, which causes sensor observations associated with the individual activities to interleave in time. The specific scenario considered in this paper is that ob jects used in activities of daily living are equipped with small RFID tags, which are picked up by a wearable RFID reader whenever a user interacts with the ob ject. The task is to infer the sequence of activities carried out given the observed ob ject interactions. In the light of recent advances in RFID technology, which allow tags to be cheaply mass-produced and readers to be made wearable, such application scenarios are attracting increasing research interest from both academia and industry (Wang et al., 2007). HMMs have been widely used in activity recognition: activities are modeled as hidden states that emit the ob ject tags observed by the RFID reader (Patterson et al., 2005). This is an appropriate model if activities are atomic and carried out sequentially. In many domains, however, activities are hierarchically structured, as sets of basic activities can be grouped into high-level activities. High-level activities typically interleave in time as a user is switching between them, as illustrated in Figure 1. In this example, a user is having breakfast, which consists of high-level activities makeToast, makeJuice, and getNews with corresponding basic activities. The domain could be modeled with a standard HMM by "flattening" the three activities into one process with 7 states, but in this case part of the problem structure would be lost. Alternatively, we can model the activities as three different processes which interleave in time. This has the advantage of decoupling transition dynamics within one high-level activity from the interleaving behavior, yielding a more concise representation with fewer parameters. In this paper, we present a probabilistic model in which observations are generated by multiple, interleaved hidden processes. The hidden processes are stationary Markov chains, and the switching mechanism by which they interleave is again Markov. Although there exists a large body of related work, to the best

1. Intro duction
Hidden Markov models (HMMs) are among the most popular approaches for modeling time series data, and have seen widespread application in areas such as speech recognition, bioinformatics, or robotics. They assume that observed data stems from a hidden process which is stationary and Markov. However, in some application domains this single-process model is not appropriate. Consider for instance a log of web server requests, and assume we have no definite knowledge about which request has been issued by which user (e.g. because of proxy use). Clearly, there is no single hidden Markov process that accounts for the sequence of observed requests. Instead, there are multiple processes, one per user, which interleave to generate the sequence of observations. Another example, and the main motivation for the work presented in this paper, is activity recognition : the task of inferring a user's
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Modeling Interleaved Hidden Pro cesses

Figure 1. Interleaving in an activity recognition domain. Three high-level activities (makeToast, makeJuice, getNews) with corresponding basic activities are interleaved in time as a user switches between them. Different activities can produce identical sensor observations, and therefore neither the interleaving nor the actual activities are directly observable from the sensor data.

of our knowledge this interleaved setting has not been addressed before. A simplified version has been discussed in (Batu et al., 2004); however, tractable inference algorithms are not explored in this work. Simplicial mixtures of Markov chains, which employ a generative semantics similar to latent Dirichlet allocation, also address a similar problem (Girolami & Kab´n, a 2003). However, they restrict the constituent processes to be Markov rather than hidden Markov. A further class of models assumes several hidden processes that run in parallel, and that observations stem from their joint state. Examples include factorial hidden Markov models (Ghahramani & Jordan, 1997), hidden Markov decision trees (Jordan et al., 1996), coupled hidden Markov models (Brand, 1997) and mixed hidden Markov models (Altman, 2007). In contrast to our approach, these models focus on factorizing complex state spaces into cross-products of simpler components, rather than modeling interleaved processes. Another related technique are switching state-space models (SSSMs) (Ghahramani & Hinton, 1998), in which several processes run in parallel and an additional switch variable selects one active process from which the current observation is generated. SSSMs are different in that processes run concurrently, while an interleaving of processes is characterized by the fact that an inactive process is stopped and only resumes when it becomes active again. This creates additional dependencies between processes which cannot be modeled in a SSSM. Finally, hierarchical hidden Markov models model hierarchical structure within the hidden process that generates the observations (Fine et al., 1998). However, the component processes cannot interleave, and thus the model is not appropriate in our domain. The next section introduces the proposed model more formally. Afterwards, we discuss the key problem of hidden state inference: given a sequence of observa-

tions, find the most likely configuration of hidden processes to have generated the data. Unfortunately, exact inference can be shown to be NP-hard; however, efficient structured approximate inference techniques can be applied (Section 3). Finally, the proposed technique is evaluated in an activity recognition domain, and shown to outperform standard HMM-based approaches (Section 4).

2. The Mo del
Let Y1 , ..., YT denote a sequence of observations, where the Yt take on one of D discrete values. A hidden Markov model µ (Rabiner, 1989) defines a sequence X1 , ..., XT of hidden state variables, with Xt  {1, ..., K } and K the number of different states the hidden process can take on. To simplify notation, assume that there is a special start state 0 the process is in at time t = 0, that is, X0 = 0. The first transition is from X0 to X1  {1, ..., K }, and afterwards the first output Y1 is emitted. The HMM is characterized by initial state probabilities a0i = P (X1 = i | X0 = 0), state transition probabilities aij = P (Xt = j | Xt-1 = i) for t  2 and emission probabilities bil = P (Yt = l | Xt = i) for t  1. The joint distribution of observations Y = Y1 , ...YT and hidden states X = X1 , ..., XT is given by P (X, Y) = tT
=1

P (Xt | Xt-1 )P (Yt | Xt ).

We will also refer to X as the hidden process that generated the observations Y. We propose a model for multiple, interleaved hidden processes. Intuitively, an additional switching process controls a token that is handed from process to process, and determines which of the processes is active at a particular point t in time. The active process tran-


Modeling Interleaved Hidden Pro cesses
Zt-1 Zt Zt+1

St-1

(1)

St

(1)

St+1

(1)

St-1

(1)

St

(1)

St+1

(1)

St-1

(2)

St

(2)

St+1

(2)

St-1

(2)

St

(2)

St+1

(2)

St-1

(3)

St

(3)

St+1

(3)

St-1

(3)

St

(3)

St+1

(3)

Yt-1

Yt

Yt+1

Yt-1

Yt

Yt+1

Figure 2. Interleaved mixture of hidden Markov models (left) and factorial hidden Markov model (right) in dynamic Bayesian network notation (for M = 3).

sitions to a new state and outputs the observation Yt , while all other processes remain "frozen" in time. More formally, let µ1 , ..., µM be hidden Markov models (m) with initial state probabilities a0i , transition proba(m) (m) bilities aij and emission probabilities bil . For ease of notation, we assume the number of states K is identical for all µm , but the model trivially generalizes to processes with state spaces of different size. Let furthermore µ be a Markov process with states {1, ..., M }, ¯ initial state probabilities d0i and transition probabilities dij . Let Zt denote a random variable represent(m) ing the state of µ at time t, and St ¯ denote random variables representing the state of process µm at time t for 1  m  M . Zt  {1, ..., M } determines the active process at time t, and we will refer to µ ¯ as the switching process. At every step t in time, a new active process is sampled from µ with probability ¯ P (Zt = j | Zt-1 = i) = dij . Afterwards, the states of µ1 , ..., µM are updated according to a(m) = i, Zt = k ) =
ij

Z = Z1 , ..., ZT and S = S1 , ..., ST . Then P (Z, S, Y) = tT
=1 M m =1

P (Zt | Zt-1 )P (Yt | St , Zt )

P (St

(m)

| St-1 , Zt ) (3)

(m)

We will refer to this model as an interleaved mixture of hidden Markov models. It is represented by the dynamic Bayesian network structure given in Figure 2 (left). The model is structurally related to a factorial hidden Markov model (Ghahramani & Jordan, 1997), shown in Figure 2 (right). However, the structure is extended by the additional chain of Zt nodes that determine the currently active process. Although the structure is densely connected, the set of parameters is simply the union of the parameter sets of the constituent HMMs µ1 , ..., µM and the switching process µ. ¯ The following alternative interpretation of the model can be given. Let z denote an interleaving1 and let tm , ..., tmm denote the sequence positions for which 1 T zt = m. That is, Yµm = Ytm , ..., Ytmm is the pro jec1 T tion of Y to elements generated by µm , and Sµm = (m) (m) Stm , ..., Stmm the corresponding hidden state vari1 T ables. It is easily verified that P (Z, S, Y) = P (Z)
M m =1

(m) P (St

=j|

(m) St-1

ij

k = m; k = m, (1)

where ii = 1 and ij = 0 for i = j . In other words, a process µm transitions to a new state with probability given by its transition matrix if it is active at time t, and stays in its old state otherwise. Finally, the probability of emitting symbol Yt is P (Yt = l | St
(1) (M ) (k )

Pµm (Yµm , Sµm ),

= i1 , ..., St

= iM , Zt = k ) = bik l (2)

where Pµm (Yµm , Sµm ) is the joint distribution of hidden states Sµm and observations Yµm in the original HMM µm . This reformulation gives rise to an intuitive approach for sampling from Y: first sample an interleaving pattern z from µ, and afterwards Yµm ¯ from µm for 1  m  M .
1 In general, we denote random variables with uppercase letters, and their instantiations with lower-case letters.

That is, it is given by the emission probability of the (1) (M ) process that is active at time t. Let St = St , ..., St ,


Modeling Interleaved Hidden Pro cesses

3. Inference and Parameter Estimation
A key task in the activity recognition domains we have in mind is hidden state inference : find (z , s ) = argmax P (z, s | y)
z, s

3.2. Approximate Inference Approximate inference in graphical models has received much attention, and a variety of techniques are available. The most simple class of methods are Markov chain Monte Carlo (MCMC) approaches. In Gibbs sampling, for instance, iterative conditional resampling of random variables defines a Markov process whose stationary distribution--under certain conditions--will be the conditional distribution in Equation (4). However, MCMC is not an effective inference method in our case, because the Markov process defined by the Gibbs sampler is not ergodic. There can be two state configurations with positive probability that cannot be transformed into each other by single-variable changes without passing through an invalid (probability zero) configuration, such as any con(m) (m) figuration with St-1 = St but Zt = m. This effectively traps the Gibbs sampler in a subspace of all configurations and prevents MCMC convergence. The problem is that Gibbs sampling, by updating only one variable at a time, ignores the specific model structure. Instead, we have to resort to approximate inference methods that better exploit model structure. Examples include structured variational approximations (Ghahramani & Jordan, 1997) and an iterative approximate inference method known as the chainwise Viterbi algorithm (Saul & Jordan, 1999). These algorithms are used in factorial HMMs for computing EM statistics and hidden state inference. In the rest of the Section, we present an extension of chainwise Viterbi for solving the problem given by Equation (4). The idea behind chainwise Viterbi is to repeatedly solve tractable sub-problems of the (intractable) global optimization problem. For factorial hidden Markov models, the natural sub-problem to solve is to opti(m) (m) mize hidden states in one chain S(m) = S1 , ..., ST conditioned on the current states of the other chains:
() snmw = argmax P (s(m) | {s(l) : l = m}, y) e s(m)

(4)

for a given sequence y of observations. This involves simultaneously finding a segmentation of y into subsequences yµm generated by µm (the z ), and most likely hidden states for yµm in µm (the s ). 3.1. Exact Inference Two special cases of the problem are trivial. For M = 1, the model coincides with a hidden Markov model, and the Viterbi algorithm returns the most likely hidden states in time O(K 2 T ). Moreover, if the output symbol sets of µ1 , ..., µM are disjoint, the interleaving z is directly observable, and s can be obtained by running M instances of Viterbi in time O(M K 2 T ). The more interesting case of M  2 and non-disjoint output symbol sets is inherently more difficult due to its combinatorial nature--the M constituent chains are coupled via the switching process and observations, and thus cannot be handled independently. Accordingly, exact graphical model inference (e.g. with the junction tree algorithm) applied to the model in Figure 2 (left) has costs exponential in M , because the cliques at Yt are of size O(M ). In fact, for general graphical model structures of this form there is no tractable inference algorithm available. However, the conditional distributions P (Yt | St , Zt ) have a particularly simple form, which could make the problem easier. Unfortunately, this is not the case: Theorem. Exact inference for interleaved mixtures of hidden Markov models is NP-hard. The theorem is proved by reduction from the strongly NP-hard 3-partition problem (Garey & Johnson, 1975): Problem (3-partition problem). Let S be a multiset of M = 3N positive integers. Is there a partition of S into subsets S1 , ..., SN of size 3 each such that the sum over the integers in each subset is the same? A detailed proof is omitted for lack of space. Intuitively, the relationship is that an interleaving of µ1 , ..., µM "partitions" a given sequence into the parts generated by the different processes (cf. Figure 1). Note that a key issue is the strong NP-hardness of 3partition: the problem is NP-hard even if numbers in the input are given in unary notation (or, equivalently, if integers in S are polynomially bounded in M ).

= argmax P (s(1) , ..., s(M ) , y).
s(m)

In the dynamic Bayesian network representing an interleaved mixture of HMMs (Figure 2, left), there are two different types of hidden chains: the chains S(1) , ..., S(M ) representing the constituent processes µ1 , ..., µM and the chain Z representing the switching process µ. Assume first that Z is kept fix, and the ¯ goal is to conditionally optimize a chain S(m) . This is straightforward: for a given interleaving pattern, the chains S(1) , ..., S(M ) become independent given Z and Y due to the special form of the conditional distribu-


Modeling Interleaved Hidden Pro cesses

Algorithm 1 Chainwise Viterbi for interleaved mixtures of hidden Markov models Input: model M, observations Y (S, Z) := consistent-configuration(M) while not converged do choose m, n  {1, ..., M }, m = n let (S(m) , S(n) , Z) := argmax P (S, Z, Y)
S(m) ,S(n) ,Z

restrictive form of the model (basically, that only the active chain changes state at any point in time) can be exploited. This allows much faster inference than for general graphical models with the DAG structure given in Figure 2 (left), as will be briefly outlined now. To simplify notation, assume that n = 1 and m = 2. In analogy to the Viterbi algorithm, define ij k [t] = max P (D, St
D (1) (2)

end while return S, Z tions P (Yt | St , Zt ), cf. Equation (2). They can thus be optimized independently with standard Viterbi. We therefore focus on the task of optimizing Z given S(1) , ..., S(M ) . A straightforward update znew = argmax P (z, s(1) , ..., s(M ) , y)
z

= i, St
(1)

= j, Zt = k , Y, S(3) , ..., S(M ) )

with D = {S1 , ..., St-1 , S1 , ..., St-1 , Z1 , ..., Zt-1 }. Initialization of ij k [1] is straightforward. For the recursive definition of ij k [t], let C [k ] =
M m =3 (1) (2) (2)

P (St
(m)

(m)

= st

(m)

| St-1 = st-1 , Zt = k ),

(m)

(m)

is not very effective: as a process µm can only change (m) (m) state at time t if it is active, we know from St = St-1 that Zt = m. Thus, the joint state of S(1) , ..., S(M ) essentially determines Z. To change the state of Zt (m) from m to n, it is necessary to also update St and ( n) St to reflect that µn is now active at time t. The solution is to jointly optimize two constituent chains S(m) , S(n) and the switching chain Z by
) ( (s(mw , snn) , znew ) ne ew

st for m  3 denote the current values of where the fixed chains µ3 , ..., µM . Now two cases have to be considered. If k  3, chains 1, 2 cannot have changed state, and ij k [t] =
(k ) k
=1,...,M

(m) st-1 ,

max

ij k

[t

- 1]dk k b(k) C [k ] sy

= argmax P (s, y, z).
s(m) ,s(n) ,z

(5)

Intuitively speaking, this update allows to re-assign observations that have so far been attributed to process µm to process µn , by changing some Zt from m to n and updating S(m) and S(n) accordingly. If it is repeatedly applied with different process indices m, n, the interleaving can be arbitrarily revised. Algorithm 1 describes this chainwise update scheme in pseudocode. The method consistent-configuration(M) initializes the states of the hidden variables to some positive-probability configuration2 . When choosing m, n  {1, ..., M } different strategies are possible; we assume the algorithm repeatedly cycles through all pairs n = m. If the update step (5) is implemented exactly, P (s, z, y) will increase unless the hidden state configuration is left unchanged. Thus, the algorithm will always converge (though not necessarily to the true global optimum). An efficient implementation of the update step (5) is crucial for fast inference. This can be achieved by dynamic programming in the spirit of the Viterbi algorithm (Rabiner, 1989). Moreover, the particularly
2 This is trivial if observation probabilities are always non-zero, as e.g. in Laplace-smoothed models.

with s = St and y = Yt . This quantity can be computed in time O(M ). If k  {1, 2}, we have to take into account state changes on the chains being optimized. Assume without loss of generality that k = 1. Now ij 1 [t] =
k
=1,...,M

max

i

=1,...,K

max i j k

[t

- 1]dk 1 ai i biy C [1],

(1) (1)

with y = Yt . This quantity can be computed in time O(K M ). There are O(K 2 M T ) values of the form ij k [t] to compute. However, time for computing all values is bounded by O(K 2 M (M + K )T ), as the case k  {1, 2} only appears O(K 2 T ) times. The maximum probability of a hidden state configuration is
s(1) ,s(2) ,z

max P (s, z, y) = max ij k [T ],
ij k

and a maximizing configuration is found by keeping track of where maxima occur in backtrace variables. It is instructive to compare the complexity of the outlined chainwise Viterbi algorithm to inference in an HMM where hidden states are "flattened" into a single process. This HMM has a state space of size K M , and standard Viterbi has thus complexity O(K 2 M 2 T ), similar to the O(K 2 M (M + K )T ) for a single update step in chainwise Viterbi. However, several such update steps will be needed before convergence.


Modeling Interleaved Hidden Pro cesses

3.3. Parameter Estimation There are different possible settings for learning the proposed model from data. In the activity recognition setting discussed in Section 4, both sensor observations and activities are given for the training set. In this fully observable case maximum-likelihood model parameters can essentially be determined by counting. More generally, if the interleaving is known for the training data (that is, we know which part of each sequence has been generated by which process), the problem reduces to independently estimating the parameters of µ1 , ..., µM with the standard Baum-Welch algorithm (Rabiner, 1989). In an unsupervised learning setting, expectation-maximization including the unknown interleaving Z is a natural choice. However, for the same reasons as discussed in Section 3, exact computation of the expectation step will be infeasible. In factorial hidden Markov models, this problem is solved elegantly by a structured variational approximation, and exploring variational inference methods for the interleaved mixture model presented in this paper is an interesting direction for future work. A simple alternative is to employ hard EM : instead of computing exact expectations, hidden states are set to their max-likelihood values given the observations, and expectations determined by counting. Together with the chainwise Viterbi algorithm discussed in Section 3.2 this yields a tractable method which is straightforward to implement.

ob ject was observed), and a total of 4597 timepoints to be classified. Timepoints at which no activity is taking place and activities with a coverage of less than 1% were removed, leaving 14 activities and 3545 timepoints in the dataset. The average number of segments into which a high-level activity is broken up because of interleaving is 3.95. There is significant overlap between observations associated with different activities, either because the same ob ject is used in different activities or noise in the sensor data. More specifically, the average overlap in the set of observations associated with two different activities is 40.6%. A standard approach in ADL recognition is based on HMMs: each basic activity corresponds to a hidden state, and sensor data to observations. In the described domain this means that all activities are "flattened" into one hidden process, and their hierarchical structure is lost. This approach will serve as a baseline, denoted by HMM. Alternatively, high-level activities can be modeled as separate hidden processes using the model described in Section 2. Here we consider a slight extension of this model: state transition probabilities in the active hidden process µZt depend not only on the previous state but also on whether or not the process has just become active; that is, Zt = Zt-1 . The motivation for this extension is that high-level activities are typically interrupted at a point where the basic activity changes as well. It is straightforward to generalize the model and algorithms discussed in Section 2 and Section 3 to include this dependency. Each high-level activity A is represented as a process µA , and the state space of µA are the basic activities associated with A. Note that the method, when applied to a given observation sequence, will automatically chose the (approximately) most likely subset of high-level activities that explains the observations. This model, together with the approximate inference technique discussed in Section 3.2 will be denoted as HMMmix. In the chainwise Viterbi algorithm, hidden states are initialized to the most likely activity given the current sensor observation (as observed in the training data). Furthermore, a version with exact inference (denoted HMMmix*) is run for comparison. The experimental study seeks to answer the following two questions: (Q1) Does reconstruction accuracy increase if highlevel activities are modeled as separate processes? (Q2) Does the approximate inference algorithm for HMMmix yield results similar to exact inference? The rationale behind (Q1) is that modeling high-level

4. Exp erimental Evaluation
The proposed model has been evaluated in an activity of daily living (ADL) recognition domain, where the goal is to infer a user's activity from a stream of dense RFID sensor data. The dataset has been collected in a real RFID environment at Intel Research Seattle (Landwehr et al., 2007). Ob jects are equipped with small RFID tags, and the user is wearing a lightweight RFID reader in a bracelet around the wrist. Whenever the reader comes close (10­15 centimeters) to a tagged ob ject, the ob ject tag is recorded. The sequence of observed tags thus indicates the ob jects a user has been interacting with while performing the activity. We recorded activities involved in making breakfast at home, as this domain showcases the kind of interleaving behavior we are interested in (cf. Figure 1). The dataset consists of 20 sequences of RFID tag observations collected from 5 different persons having breakfast. Sequences are hand-labeled with the true current activity based on a human observer. There are 18 basic activities organized into 6 high-level activities, 24 different classes of tagged ob jects (including nil if no


Modeling Interleaved Hidden Pro cesses
-1 1

Normalized log-likelihood

Metho d Majority Majority/Observation HMM HMMmix HMMmix*

Accuracy 21.2 ± 25.4· 71.4 ± 10.3· 84.0 ± 9.8· 86.0 ± 8.6 86.0 ± 8.6

-1.6

0.7

-1.8 normalized log-likelihood reconstruction accuracy -2 0 5 10 15 20 Iteration of chainwise Viterbi 25

0.6

0.5

activities as separate processes will capture transition dynamics more concisely, as it decouples dynamics within a high-level activity from the switching dynamics. This is reflected in the number of model parameters: The "flattened" HMM representation requires O((M K )2 ) = O(M 2 K 2 ) parameters to specify transition dynamics, while HMMmix only requires O(M 2 + M K 2 ) parameters. To evaluate the different approaches, we performed a leave-one-sequence-out cross-validation. On the respective training set, models are estimated from fully observable training data, i.e., information on both sensor observations and activities is available. Given a test sequence, the most likely joint state of hidden variables in the model is determined, yielding a prediction of the current basic activity at every point in time. This is compared against the known true activity, and average prediction accuracy is computed. Table 1 shows reconstruction accuracy for HMM, HMMmix and HMMmix*. Additionally, accuracy for always predicting the most frequent activity (Majority), and the most frequent activity given a particular sensor observation (Majority/Observation) are shown. HMMmix significantly outperforms HMM (paired two-sided ttest, p = 0.05), and predictions made by HMMmix and HMMmix* are identical in this experiment. This affirmatively answers questions Q1 and Q2. Figure 3 shows the convergence behavior of chainwise Viterbi. The normalized log-likelihood of the current configuration of hidden states and the reconstruction accuracy given by this configuration are plotted as a function of the algorithm iteration. As expected, both likelihood and accuracy increase as the algorithm repeatedly revises the current interleaving. Furthermore, convergence occurs after a small number of iterations. There are two sources of information for predicting the activity at a point t in time: the current sensor observation, and transition dynamics for activities

Figure 3. Normalized log-likelihood and reconstruction accuracy of the current hidden state configuration as a function of the number of iterations in chainwise Viterbi. Results are averaged over all test sequences.
0.9 0.85 0.8 Accuracy 0.75 0.7 0.65 0.6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Fraction of training data used 0.1 HMM HMMmix* HMMmix

Figure 4. Reconstruction accuracy as a function of the fraction  of training sequences used to estimate model transition parameters, while emission probabilities are estimated from all available training data. Results are averaged over 5 runs of cross-validation.

(which capture the influence of past and future observations on the current prediction). The Majority/Observation approach already performs well; this indicates that much information is obtained simply from the current sensor observation. To further investigate the influence of transition dynamics on reconstruction accuracy, the following experiment was carried out. When estimating a model from data, only a randomly selected fraction  of the training sequences is used to estimate transition probabilities, while all available data is used to estimate emission probabilities. Figure 4 shows reconstruction accuracy as a function of  . The experiment confirms that HMMmix outperforms HMM, and that approximate inference gives solutions very close to those of exact inference (solutions differ slightly, but the curves for HMMmix and HMMmix* in Figure 4 are indistin-

Accuracy

Table 1. Average cross-validated accuracy for Majority, Majority/Observation, HMM, HMMmix and HMMmix* on the ADL dataset. · indicates that result for HMMmix is significantly better than result for other method (paired two-sided t-test, p = 0.05).

-1.2

0.9

-1.4

0.8


Modeling Interleaved Hidden Pro cesses

guishable). Moreover, the difference between HMM and HMMmix is most pronounced if only 20% to 40% of training sequences are used to estimate transition parameters. This supports the hypothesis that the more concise representation of transition dynamics in HMMmix (with fewer model parameters) explains its superior performance, as a concise representation matters most if training data is sparse.

Brand, M. (1997). Coupled hidden Markov models for modeling interactive processes (Technical Report 405). MIT Media Lab. Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32, 41­62. Garey, M. R., & Johnson, D. S. (1975). Complexity Results for Multiprocessor Scheduling under Resource Constraints. SIAM Jour. Comp., 4, 397­411. Ghahramani, Z., & Hinton, G. E. (1998). Switching State-Space Models (Technical Report). Department of Computer Science, University of Toronto. Ghahramani, Z., & Jordan, M. I. (1997). Factorial Hidden Markov Models. Mach. Lear., 29, 245­273. Girolami, M., & Kab´n, A. (2003). Simplicial Mixa tures of Markov Chains: Distributed Modelling of Dynamic User Profiles. Proc. of the 17th Ann. Conference on Neural Information Processing Systems. Jordan, M. I., Ghahramani, Z., & Saul, L. K. (1996). Hidden Markov Decision Trees. Proceedings of the 9th Conference on Advances in Neural Information Processing Systems. Landwehr, N., Gutmann, B., Thon, I., Philipose, M., & De Raedt, L. (2007). Relational Transformationbased Tagging for Human Activity Recognition. Proc. of the Intern. Workshop on Know ledge Discovery from Ubiquitous Data Streams. Patterson, D., Fox, D., Kautz, H., & Philipose, M. (2005). Fine-Grained Activity Recognition by Aggregating Abstract Ob ject Usage. Proc. of the 9th IEEE Intern. Symp. on Wearable Computers. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257­286. Saul, L. K., & Jordan, M. I. (1999). Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones. Machine Learning, 37, 75­87. Wang, S., Pentney, W., Popescu, A.-M., Choudhury, T., & Philipose, M. (2007). Common Sense Based Joint Training of Human Activity Recognizers. Proceedings of the 20th International Joint Conference on Artificial Intel ligence. Zhang, W., Chen, F., Xu, W., & Cao, Z. (2007). Decomposition in hidden Markov models for activity recognition. Multimedia Content Anal. and Mining.

5. Conclusions and Related Work
We have introduced a model for interleaved mixtures of hidden processes, which was shown to be superior to a single-process model in an activity recognition domain. The model should be generally applicable in situations where only the interleaved output of several independent processes can be observed. Related work includes several extensions of hidden Markov models (as discussed in Section 1), and activity recognition approaches based on HMMs such as (Patterson et al., 2005) and (Zhang et al., 2007) or dynamic Bayesian networks (Wang et al., 2007). The proposed method not only labels sequence positions but returns a structured parse of the sequence in terms of a set of hidden processes. Thus, it is also related to segmentation models, grammar-based approaches, and more generally models for predicting structured data (see (Bakir et al., 2007) for an overview). Directions for future work include semi- and unsupervised learning settings, and testing the model in different domains and on larger activity recognition datasets. Acknowledgments The author would like to thank Matthai Philipose and Intel Research Seattle for making the activity recognition dataset available, and Luc De Raedt, Ingo Thon, Bernd Gutmann and Siegfried Nijssen for helpful discussions. The comments from the anonymous reviewers also helped to improve the paper. This work was supported by the Research Foundation-Flanders (FWO-Vlaanderen), and GOA/08/008 pro ject "Probabilistic Logic Learning".

References
Altman, R. M. (2007). Mixed hidden Markov models: An extension of the hidden Markov model to the longitudinal data setting. Journal of the American Statistical Association, 102, 201­210. Bakir, G. H., Hofmann, T., Sch¨lkopf, B., Smola, o A. J., Taskar, B., & Vishwanathan, S. V. N. (2007). Predicting Structured Data (Neural Information Processing). The MIT Press. Batu, T., Guha, S., & Kannan, S. (2004). Inferring Mixtures of Markov Chains. Proceedings of the 17th Annual Conference on Learning Theory.