On-line Discovery of Temp oral-Difference Networks

Takaki Makino mak@scint.dpc.u-tokyo.ac.jp Division of Pro ject Coordinate, Tokyo University, 5-1-5 Kashiwa-no-ha, Kashiwa-shi, Chiba 277-8568 Japan Toshihisa Takagi tt@k.u-tokyo.ac.jp Database Center for Life Science, Research Organization of Information and Systems, Tokyo 113-0022 Japan

Abstract
We present an algorithm for on-line, incremental discovery of temporal-difference (TD) networks. The key contribution is the establishment of three criteria to expand a node in TD network: a node is expanded when the node is well-known, independent, and has a prediction error that requires further explanation. Since none of these criteria requires centralized calculation operations, they are easily computed in a parallel and distributed manner, and scalable for bigger problems compared to other discovery methods of predictive state representations. Through computer experiments, we demonstrate the empirical effectiveness of our algorithm.

of the probability or expected value of some function of future predictions, actions and observations. The predictions can be considered as "answers" to a set of "questions" represented in the TD network. One important problem in the research of predictive representations is the discovery problem, that is, the problem of determining the set of questions (or core tests ) so that the state of a dynamical system is correctly represented by the vector of predictions for these questions. Many of the existing studies on this problem (Rosencrantz et al., 2004; James & Singh, 2004; Wolfe et al., 2005) utilize off-line discovery and learning on PSRs, and therefore they are hardly applicable to both implementing live-interaction agents and finding corresponding activity in the human brain. There is an on-line discovery algorithm (McCracken & Bowling, 2006) for PSR core tests, but it requires complex operations on a large matrix, such as calculation of the maximum linear independent set, and is therefore not suitable for parallel and distributed computing. As far as we are aware, no algorithm has been proposed for discovery of questions in TD networks. Study of a TDnetwork discovery algorithm that is suitable for parallel distributed computing would contribute not only to research on predictive representations but also to research on cognitive science by providing a hypothesis for the algorithm actually used in the human brain. In this study, we propose an algorithm for discovering the correct set of tests by incremental node expansion in a TD network. Our key contribution is in the criteria that we have developed for node expansion: a node is expanded when the node is well-known, independent, and has a prediction error that requires further explanation. To check these criteria, our algorithm maintains the average squared error and average variance for each node, and it introduces a dependency detection network. Since none of these criteria requires centralized operations such as calculation of linear independence in the whole representation matrix, they

1. Intro duction
Predictive representations (Littman et al., 2002; Jaeger, 2000) are a relatively new group of approaches for expressing and learning grounded knowledge about dynamical systems. These approaches represent the state of a dynamical system as a vector of predictions, based on the hypothesis that important knowledge about the world can be represented strictly in terms of relationships between predictions of observable quantities. In the predictive state representations (PSRs) introduced by Littman et al. (2002), each prediction is an estimate of the probability of tests, defined as some sequence of observations given a sequence of actions. Sutton and Tanner (2005) proposed another approach for predictive representations, namely Temporal-Difference (TD) networks. TD networks are developed as a generalization of PSRs: in TD networks, each prediction is an estimate
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


On-line Discovery of Temp oral-Difference Networks
a1 y1 a1 y3 a2 y4 a1 y5 a2 y2 a2 y6

action a2 is taken, and so on. To provide an answer for the questions asked by the question network, each node in a TD network works also as a function approximator. The inputs to the function approximator of a node are defined by answer network, taking values from other nodes, available observations, and actions to be taken. These function approximators are trained so that the output of the nodes becomes the answers to the question asked by the question network. However, to provide an accurate answer, the set of nodes have to be a sufficient representation for the environmental state; in other words, a correct set of questions have to be posed by the question network. The focus of this paper is the discovery of the question network, i.e., to find the structure of the question network that is sufficient for prediction. Formally, we denote the prediction for node i at time i step t as yt  [0, 1], i = 1, . . . , n. The prediction vector 1 n yt = (yt , . . . , yt )T is given by the answer network: yt =  (Wt xt ) , (1) where xt  m is a feature vector, Wt is a n × m matrix of modifiable weights, and  is the S-shaped logistic function  (s) = (1 + e-x )-1 . The feature vector is a function of the preceding action, observation, and node values: xt = x(at-1 , ot , yt-1 )  Rm . (2) We used the similar form of feature vector as appeared in the work of Tanner and Sutton (2005a); in our experiments, where two actions (L and R) are possible, { (o, 1 - o, y 1 , . . . , y n , 0, . . . , 0)T a = L x(a, o, y) = (3) (0, . . . , 0, o, 1 - o, y 1 , . . . , y n )T a = R This is equivalent to separate W for each action. The question network, which gives the target of predictions in terms of the node values at the next time steps, is represented by a n × (n + l) matrix Z a and a vector c. Without eligibility traces, the vector of the () target values is yt zt-1 = ct  Z + Żt  yt-1 , c (4) ot where  is element-by-element multiplication, and each element of Z, c and Ż is: c { i 1 y is the parent node of y j z ij = , (5) 0 otherwise { 1 at satisfies the node y i 's condition i ct = , (6) 0 otherwise ci = 1 - ci Żt t
i zt-1

Figure 1. Example TD network we focus in this paper.

are easily computed in a parallel and distributed manner, which is an important property for seeking the algorithm used in the human brain. Although the algorithm has no theoretical guarantee to find the question network (indeed, it is known to be failed in some cases), our simulation experiments demonstrate the empirical effectiveness of our algorithm. Section 2 reviews PSR and TD network that is necessary to understand our algorithm. Section 3 describes our incremental discovery algorithm. Experiments and results are shown in Section 4. After that, we discuss related work and future directions in Sections 5 and 6.

2. TD Networks
In this section, we make a brief review on TD network, based on the work by Tanner and Sutton (2005a). The purpose of TD networks is to learn prediction of future observation obtained from the environment. Consider a partially observable environment, which changes its state according to an agent's action at  A at every time step t, but the agent can only have a partial (and possibly noisy) observation ot+1  O of the state. Generally ot can be a vector consisting of l bits, but in this paper, we consider the case that the observation is a single bit (l = 1). A TD network consists of a set of nodes and links, and each node represents a single scalar prediction. A node has one or more links directed to other nodes or the observation from the environment, which denotes the targets for prediction of the node. A link may have a condition, which indicates that the node is a conditional prediction of the target. This set of nodes and links are called the question network since each node is some question about the environment. As in the previous studies (Tanner & Sutton, 2005b; Tanner & Sutton, 2005a), we focus on a subset of TD networks, in which every node has a single target (hereafter, the parent node) and every link is conditioned with an action. Figure 1 is an example of such a TD network. The node y 1 predicts the observation at the next step if action a1 is taken. The node y 4 predicts the value of the node y 1 at the next step if

. (7)
p(i) yt

The elements of Z are assigned so that = for any i with ci = 1, where p(i) is the parent node of i. t


On-line Discovery of Temp oral-Difference Networks
i i On the other hand, if ci = 0, zt-1 = yt-1 , and the TD t error of the node becomes zero so that only the weights for the nodes that satisfy the condition are updated.

In this study we employ eligibility traces (Tanner & Sutton, 2005a), which is a technique to accelerate learning in TD-error learning by incorporating further prediction into the learning target. In a forward view, zt-1 = ct  Z(zt + (1 - )yt ) + Żt  yt-1 c , (8) where   [0, 1] is a parameter that controls the balance of temporally distant results in the learning target. When  = 0, eq. (8) is equivalent to eq. (4), and no eligibility traces are used. However, this formula recursively contains future values of z , and it is not easy to be calculated on-line. Tanner and Sutton (2005a) proposes an algorithm that performs on-line update of the weight vector to make the equivalent update as (8). Then each component i wtj of Wt is updated by the learning rule: i ij i i i  yt wt+1 = wtj + (zt - yt ) ij  wt ij i ii i = wt + (zt - yt )yt (1 - yt )xj , (9) t in which the second line is derived from eq. (1). Roughly, the operation of a TD network proceeds by repeating the following steps: (1) Choose an action at-1 and receive an observation ot from the environment. (2) Operate the answer network, i.e., calculate feature vector xt = x(at-1 , ot , yt-1 ) and obtain the new predictions yt =  (Wt xt ). (3) Use the question network to obtain the target value for the previous predictions zt-1 = z(yt , ot ), and update the weights W according to the TD error zt-1 - yt-1 . For details, readers should consult the original paper of the TD network (Sutton & Tanner, 2005) to see subtle points, such as the precise order of calculation.

2. The no de is indep endent: The prediction of the node cannot be calculated in terms of other, formerly known node values. In terms of PSRs, the node represents a core test. This criterion avoids redundant expansion of the nodes. 3. The no de's error requires further explanation: The node's prediction error is not smaller than expected from the prediction error of the node's parent node. This criterion chooses the node that has the best descriptive power for the error in the parent node, and stops expansion when unpredictability is solved. In the following, we first describe the variables that our algorithm maintains to check these criteria, and we present more detailed conditions for the criteria. 3.1. Variables 3.1.1. Dependency Detection Network Our algorithm uses a dependency detection network, which tries to represent a prediction of the node y i in terms of observation o and values of the nodes with younger index y j (j < i). If the network succeeds to represent y i with small error, we can see that y i is dependent to the predictions with younger index (namely, not a core test of PSRs), and exclude it from the candidate of node expansion. Otherwise, we can assume that y i is an independent node. Note that it corresponds to the core test in non-linear PSRs (Rudary & Singh, 2004) because the nodes in TD networks correspond to e-tests in non-linear PSRs (Sutton & Tanner, 2005) and we use sigmoidal function in the answer network. Formally, the dependency detection network is represented by Dt , a n × (n + l) matrix of modifiable weights. In the matrix dij is restricted to ze(o )f i  j - l. The ri o output of the network is dt =  (Dt yt ). The network t is trained so that dt is close to yt ; in other words, the network tries to represent a prediction of the node y i in terms of observation o and predictions with nodes with smaller index y j (j < i). Since the indices of the nodes are numbered in order, newly expanded nodes are always given higher indices, and are not used by the dependency detection network for describing older nodes with lower indices. Each component dij of Dt is updated by the learning t rule similar to eq. 9:  di i dij 1 = dij + D (yt - di ) itj (10) t t t+  dt j i = dij + D (yt - di )di (1 - di )yt (11) t tt t

3. On-line Discovery Algorithm
We propose an algorithm that performs on-line discovery of the question network. Our algorithm starts with the minimal network, which consists of the observation node and a set of prediction nodes for the observation nodes, one for each action. During learning, the algorithm grows the network by expanding leaf nodes by adding a set of prediction nodes for the node. Intuitively, a node is expanded when the following three criteria holds: 1. The no de is well-known: The agent has sufficient experience to learn the node. This criterion prevents relatively unexplored nodes to be expanded.


On-line Discovery of Temp oral-Difference Networks

But the update is limited in the area i < j - l. We assign the learning rate of the dependency detection network D to be larger than that of the answer network  to allow the dependency detection network to track changes in the calculated node values during the learning of the answer network (as long as the values are dependent). 3.1.2. Average Errors Our algorithm gathers statistical information about the prediction and error of the TD network and the dependency detection network. To allow on-line learning, the statistical variables are calculated in forms of exponential moving averages:
LER L yt+1 R = 2 yt ERR+ (1 - 2 )((zt - yt )  (zt - yt )) (12) ER dS+1 R = 1 dSERR+ (1 - 1 )((yt - dt )  (yt - dt )) (13) t t ER dL+1 R = 2 dLERR+ (1 - 2 )((yt - dt )  (yt - dt )) (14) t t

node y i is well-known. · Learning error is not in a decreasing trend (we assume the error is always decreasing during initial learning phase). · Learning error of the node gets smaller compared with that of its parent node. In formal representation, dSERR i  dLERR i t t dLERR i t
LERR p(i) dt

and

(15)

 . (16) LERR p(i) If p(i) is the observation bit, then dt is considered as 1 (the largest possible value). Eq. 15 works because these variables, representing moving averages of errors, are initialized with the highest possible value (see Section 3.1.2). Thus it is expected that the short-term average error variables stay lower than the long-term ones until the end of the initial learning period, in which the learning error is constantly decreasing. Eq. 16 is usually satisfied with a plenty amount of experience because the dependency network has no connection from a child node to a parent node. The parent node always has fewer inputs in the dependency network than the child node; if the child node has an unexplained dependency (large dLERR i ), it is likely that t the parent node also has an unexplained dependency. 3.2.2. The node is independent To prevent dependent (non-core test) nodes to be expanded, we require that the learning error in the dependency detection network is not small. dLERR i  1 t (17) 1 is a threshold parameter that controls the requirement for independence. When all the nodes that match this criterion are expanded, then all the leaf nodes in the question network becomes dependent nodes, and no further expansion occurs. Thus, this criterion is equivalent to the assumption that independent (core test) nodes does not exist as a child of dependent (non-core test) nodes. 3.2.3. The node's error requires further explanation If the prediction error of a node is larger than expected from its parent node, it is reasonable to require further prediction on the node to reduce the error. Otherwise, we can infer that the error in the parent node has other causes (e.g. another prediction node with different action conditions has large prediction error), and further

where dSERR is a short-term average of squared pret diction errors, dLERR is a long-term average of squared t prediction errors, and 0 < 1 < 2 < 1 is a temporal factor. When a node y i is added to the network, statistical variables are initialized as y LERR i = dSERR i = t dLERR i = 1.0so that the variables show larger errors t during the initial period of the node. Without eligibility traces, the target variable zt is available at time t + 1, so these parameters are easily calculated on-line. However, since eq. (12) contains quadratic term for zt , on-line calculation technique with eligibility traces such as used in Tanner and Sutton's work (2005a) cannot be used directly. Our implementation keeps the record of last k steps of node values, where k is the maximum depth of the current question network, and calculates errors of yt and dt at time t + k . 3.2. Expansion Criteria Using these variables, we check the criteria described in the beginning of Section 3 as follows. The criteria are checked for every time step. Since all criteria are described on exponential moving averages, the precise timing of criteria check and node expansion is not important; it should be inserted somewhere in the TD() network learning algorithm (Tanner & Sutton, 2005a). The expansion criterion are designed to avoid redundant expansion as much as possible because no shrinking criterion is given. 3.2.1. The node is well-known To avoid expanding relatively unexplored nodes, we use the following criteria to determine whether the


On-line Discovery of Temp oral-Difference Networks
R R 0 R 0 R L L L 0 R 0 R L 0 L L L L 0 0 R 1

Cracken, 2005).
R

Figure 2. 8-state ring world. The digit in a node represents observation from the state, and edge labels denote actions. Table 1. Tests for various 1 and 2 values in the 8-state ring world. Values are [final number of nodes] / [steps (×104 ) until learned (MSE reduces less than 10-4 )]. 1 \2 0.001 0.002 0.003 0.004 0.005 0.005 18/47 18/49 18/53 18/54 18/59 0.0075 20/46 18/48 18/55 18/57 18/59 0.01 16/49 18/51 16/55 16/58 16/60 0.0125 6/­ 6/­ 6/­ 6/­ 6/­ 0.015 6/­ 6/­ 6/­ 6/­ 6/­

We generated a sequence of experience by a uniform random policy and applied our algorithm online. Through all experiments we used 1 = 0.99, 2 = 0.999, D = 0.2,  = 0.9, 1 = 0.0045, and 2 = 0.0005.  is initialized with 0.1, and after 800,000 steps,  is halved with every 100,000 step. We measured the error of the prediction for the selected action, compared to the oracle (observation probability calculated from the structure of the environment), and the mean squared errors for every 10,000 steps are plotted. Figures 3(a) and 3(b) are the results in 5- and 8-state ring worlds. We can see that the number of nodes in the TD network increases as a result of node expansion until the prediction error decreases. This indicates that nodes in the TD-networks are expanded only when it is required. We made additional tests with various parameters 1 and 2 on the 8-state ring world (Table 1). We found that 2 affects the final number of nodes. With larger 2 , algorithm failed to make a required node expansion; with smaller 2 , the algorithm made some spurious node expansions (though the learning was successful). On the other hand, 1 mainly affects the learning time, but less related to the number of nodes. Figures 3(c) to 3(f ) show the results of our algorithm in other well-known POMDP environments. Our algorithm has successfully learned predictions in all cases. However, we found that the initial form of the TD network without node expansion can learn equally well (compared to the case started with a large TD network, which is mechanically generated by expanding all nodes), due to the high generalization capacity of TD-network with sigmoid function. Thus these experiments are not helpful for evaluating our discovery algorithm. However, we see that some node expansions are occurred in these experiments. This indicates that our algorithm sometimes makes spurious node expansions, especially in these probabilistic environments. We also evaluated the criteria we selected in terms of the discovered question network for the 8-state ring world. Figure 4 compares the mean square errors and the number of nodes in the discovered network with various settings of criteria (in these experiments  is kept constant to 0.1). Although the prediction error approaches to zero on all conditions, increase in the number of nodes is observed if any one of the criteria is removed. Figure 5 further examines the difference of the discovered TD network. In the TD network discovered using all of the criteria (Figure 5(a)), all the expanded nodes

prediction for the node is less important to reduce the error of the final prediction for the observation. In case that the parent node p(i) is purely probabilistic, error distribution of the p(i)'s child nodes is proportional to the probability that the conditions of the nodes are matched; the following condition checks that the error of the node is greater than that:
L yt ERR i  

#y i LERR p(i) y + 2 #y p(i) t

,

(18)

where #y i is the frequency that the conditions on the chain from the observation bit to node y i is matched #y i (thus, #yp(i) is the relative probability that the condition of the node is matched). If p(i) is the observation LERR p(i) bit, then yt is assumed to be zero. 2 is a threshold parameter that controls tolerance for noise.

4. Exp eriments
To test the efficiency of our algorithm, we conducted a series of computer experiments on n-state ring world (n = 5, 8) (Tanner & Sutton, 2005b). Figure 2 illustrates 8-state ring world. There are two observations, 0 or 1, and two actions, L and R. Since the ring worlds contains only deterministic state transitions and observations, we also tested our algorithm on some standard probabilistic POMDP environments taken from repository (Cassandra, 1999). Among them, environments with one-bit observation are used, and adapted to non-reward situation (we followed a previous work that describe details; Mc-


On-line Discovery of Temp oral-Difference Networks
1×10-1 1×10-2 1×10 1×10 1×10
-3

Mean squared error

Mean squared error

Mean squared error Mean squared error w/o expansion Mean squared error w/ max nodes # nodes half learning rate

1×10-1 20 18 16 # nodes 14 12 10 8 6 4 2 0 1×10-2 1×10 1×10 1×10
-3

Mean squared error Mean squared error w/o expansion Mean squared error w/ max nodes # nodes half learning rate

20 18 16 12 10 8 6 4 2 0 # nodes # nodes # nodes 14

-4

-4

-5

-5

1×10-6 1×10-7

1×10-6 1×10-7

0

2

4

6

8

10

12

14

0

2

4

6

8

10

12

14

(a) 5-state ring world
1×10-1 1×10 1×10 1×10 1×10 1×10
-2

(b) 8-state ring world
1×10-1 20 Mean squared error 18 16 12 10 8 6 4 2 0 # nodes 14 1×10 1×10 1×10 1×10 1×10
-2

Mean squared error

Mean squared error Mean squared error w/o expansion Mean squared error w/ max nodes # nodes half learning rate

Mean squared error Mean squared error w/o expansion Mean squared error w/ max nodes # nodes half learning rate

20 18 16 14 12 10 8 6 4 2 0

-3

-3

-4

-4

-5

-5

-6

-6

1×10-7

0

2

4

6

8

10

12

14

1×10-7

0

2

4

6

8

10

12

14

(c) Float-Reset
1×10-1 1×10
-2

(d) Paint
1×10-1 20 Mean squared error 18 16 12 10 8 6 4 2 0 # nodes 14 1×10
-2

Mean squared error

Mean squared error Mean squared error w/o expansion Mean squared error w/ max nodes # nodes half learning rate

Mean squared error Mean squared error w/o expansion Mean squared error w/ max nodes # nodes half learning rate

20 18 16 14 12 10 8 6 4 2 0

1×10-3 1×10 1×10 1×10
-4

1×10-3 1×10 1×10 1×10
-4

-5

-5

-6

-6

1×10-7

0

2

4

6

8

10

12

14

1×10-7

0

2

4

6

8

10

12

14

(e) Tiger

(f ) Network

Figure 3. Results of experiments. X-axis shows time steps ( × 105 )

are independent of each other, thus the discovered network is the best network for the 8-state ring world that can be discovered by our algorithm. This shows clear contrast to TD network discovered solely by dependency detection network, shown in Figure 5(b). Although this network has correctly learned to predict the 8-state ring world, some spurious node expansions are observed. This indicates that the dependency detection network is not powerful enough to choose the right node to be expanded, and shows importance of other criteria in our algorithm.

5. Discussion
The algorithm we have presented has some limitations. It seems unavoidable because the algorithm has to make inference using an incomplete question network. In this section, we discuss some of the limitations that arose in our approach. 5.1. Deep Dep endency In our algorithm, Criterion 3 (requires more explanation) are devoted for distinguishing apparently unpredictable events, caused by an incomplete question net-


On-line Discovery of Temp oral-Difference Networks
1×10-2 Mean squared error With all criteria Disable "well-knownness 1" (eq.15) Disable "well-knownness 2" (eq.16) Disable "independence" Disable "requiring explanation" # nodes 60 50 40 30 20 1×10-5 10 1×10-6 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 With all criteria Disable "well-knownness 1" (eq.15) Disable "well-knownness 2" (eq.16) Disable "independence" Disable "requiring explanation"

1×10-3

1×10-4

(a) Mean squared error

(b) Number of nodes

Figure 4. Results of experiments (average of 30 trials) on 8-state ring world with disabled criteria.

L 86 L 197 L 429 L R R R L

R 85 R 351 L R 515 L R 529 L R

O = {0, 1}, and the agent observes its own action with n-step delay. If the agent chooses actions randomly, then the observation is also random string. Moreover, unless the depth of the question network reaches n, the agent can predict nothing. For n  2, the nodes in a question network are likely to fail satisfying Criterion 3, and as a result, the algorithm fails to achieve the correct test set for the environment. A partial solution may be to introduce active learning (planned exploration). In the example above, if the agent could adopt some biased choice of actions for a while, one would observe informative result that might satisfy Criterion 3. 5.2. Selection of Expanding No de

(a) With all criteria
L 86 L 137 L 360 L 370 L R R 370 L R R L 142 L R R L 133 R 142 L R L 281 L R 405 L R R 85 R 133 L R 271 R

The algorithm always expands the node that requires more explanation, but it is possible that the node is not the best one to be expanded. The algorithm depends on an assumption that, if there is a better node to be expanded, the node also satisfies the expansion criteria sooner or later. We need to work for some theoretical support for the assumption. However, if the assumption is correct, the algorithm may perform some spurious expansion due to its distributed-processing nature. A complete solution would require either centralized processing or node shrinking. 5.3. Parameter Selection The algorithm depends on a number of parameters. Our selection of parameters seems working for the tested problems, but not guaranteed for others. In particular, the algorithm is sensitive to 2 , a threshold value used in Criterion 3 for distinguishing apparently unpredictable events from inherently random events. Due to the limitation of our approach, it seems impossible to provide a universal parameter set for producing the minimum network for any environment; a better solution would be to use lower thresholds to overgen-

(b) Without criterion "requires further explanation"
Figure 5. An example of discovered TD network for 8-state ring world. The number in a node denotes the time the node is expanded (×103 ).

work, from inherently random events. However, we can imagine cases in which this criterion does not work well, especially when the observation depends only on the actions several steps before. As an example, suppose an n-step delay world: A =


On-line Discovery of Temp oral-Difference Networks

erate the question network and shrink it afterwards. 5.4. Handling Wider Observation In this paper, we considered only the simple case with one observation bit (l = 1). When l  2, it would be necessary to consider sets of nodes that share a common action conditions and associated to different observations, and to expand all nodes in a set simultaneously. In addition, it is necessary to extend our criteria have to handle the node sets.

References
Bowling, M., McCracken, P., James, M., Neufeld, J., & Wilkinson, D. (2006). Learning predictive state representations using non-blind policies. In Proc. of ICML '06 (pp. 129­136). ACM Press. Cassandra, A. (1999). Tony's POMDP file repository page. URL http://www.cs.brown.edu/research/ai/ pomdp/examples/index.html. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371­1398. James, M. R., & Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. Proc. of ICML'04 (p. 53). ACM Press. Littman, M. L., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. In Advances in neural information processing systems 14, 1555­1561. MIT Press. McCracken, P. (2005). An online algorithm for discovery and learning of predictive state representations. Master's thesis, University of Alberta. McCracken, P., & Bowling, M. (2006). Online discovery and learning of predictive state representations. In Advances in neural information processing systems 18, 875­882. MIT Press. Rosencrantz, M., Gordon, G., & Thrun, S. (2004). Learning low dimensional predictive representations. Proc. of ICML '04 (p. 88). ACM Press. Rudary, M. R., & Singh, S. (2004). A nonlinear predictive state representation. In Advances in neural information processing systems 16. MIT Press. Sutton, R. S., & Tanner, B. (2005). Temporaldifference networks. In Advances in neural information processing systems 17, 1377­1384. MIT Press. Tanner, B., & Sutton, R. S. (2005a). TD() networks: temporal-difference networks with eligibility traces. Proc. of ICML'05 (pp. 888­895). ACM Press. Tanner, B., & Sutton, R. S. (2005b). Temporaldifference networks with history. In Proc. of IJCAI'05, 865­870. Wolfe, B., James, M. R., & Singh, S. (2005). Learning predictive state representations in dynamical systems without reset. Proc. of ICML'05 (pp. 980­987). ACM Press.

6. Related Work
McCracken and Bowling (2006) proposes the on-line discovery and learning algorithm for predictive state representation. However, their algorithm has to keep substantially long (in their example, 1000 steps) history in memory and calculate linear independency, and requires quite complex efforts to normalize the probability in their representation. On the other hand, our algorithm requires only a small amount of memory (up to the size required for eligibility traces) and, thanks to the sigmoidal function used in the TD network, no effort is required to normalize the result. We argue that these differences cause larger difference in calculation costs in a complex environment.

7. Summary
We presented an algorithm for on-line, incremental discovery of temporal-difference (TD) networks. The key contribution is the establishment of criteria to expand a leaf node in TD network: the algorithm expands a node when the node is well-known, independent, and has a prediction error that requires further explanation. Since none of these criteria requires centralized calculation operations, they can be computed in a parallel and distributed manner. Through computer experiments on n-state ring worlds, we demonstrated the empirical effectiveness of our algorithm. Among the future work the most important is to evaluate our algorithm on various environments for comparison with other discovery algorithms. Agent planning with TD network should be also studied to combine with our algorithm for developing an agent that explores environments (Bowling et al., 2006). Acknowledgments I'd like to thank Prof. Steven Kraines for his kind help. This research is partially supported by the MEXT Grant-in-Aid for Young Scientists (B) (20700126).