CT-NOR: Representing and Reasoning Ab out Events in Continuous Time Aleksandr Simma, Moises Goldszmidt, John MacCormick, Paul Barham, Richard Black, Reb ecca Isaacs, Richard Mortier Microsoft Research Abstract We present a generative mo del for representing and reasoning about the relationships among events in continuous time. We apply the mo del to the domain of networked and distributed computing environments where we fit the parameters of the mo del from timestamp observations, and then use hypothesis testing to discover dependencies between the events and changes in behavior for monitoring and diagnosis. After intro ducing the mo del, we present an EM algorithm for fitting the parameters and then present the hypothesis testing approach for both dependence discovery and change-point detection. We validate the approach for both tasks using real data from a trace of network events at Microsoft Research Cambridge. Finally, we formalize the relationship between the proposed mo del and the noisy-or gate for cases when time can be discretized. lem. In [2] we described a system called Constel lation in which computers on the network co operate to make this information available to all users of the network. Constellation takes a black-box approach to lo cally (at each computer/server in the network) learn explicit dependencies between its services using little more than the timings of packet transmission and reception. The black-box approach is necessary since any more pro cessing of the incoming and outgoing communication packages would imply prohibitive amounts of overhead on the computer/server. The lo cal mo dels of dependency can then be recursively and distributively composed to provide a view of the global dependencies. In Constellation, computers on the network co operate to make this information available to all users in the network. Constellation and its application to system wide tasks such as characterizing a networking site service and hosts dependencies for name resolution, web browsing, email, printing, reconfiguration planning and end-user diagnosis are described in [2]. This paper fo cuses on the probabilistic and statistical building blo cks of that system: the probabilistic mo del used in the lo cal learning, the EM algorithm used to fit the parameters of the mo del, and the statistics of the hypothesis testing used to determine the lo cal dependencies. The mo del, which we call Continuous Time Noisy Or (CT-NOR), takes as input sequences of input events and output events and their time stamps. It then mo dels the interactions between the input events and output events as Poisson pro cesses whose intensities are mo dulated by a (parameterized) function taking into account the distance in time between the input and output events. Through this function the domain expert is able to explicitly enco de knowledge about the domain. The paper makes the following contributions: 1 Introduction The research described in this paper was motivated by the following real life application in the domain of networked distributed systems: In a mo dern enterprise network of scale, dependencies between hosts and network services are surprisingly complex, typically undo cumented, and rarely static. Even though network management and troublesho oting rely on this information, automated discovery and monitoring of these dependencies remains an unsolved probJohn is now with Dickinson College, PA. Work done while a Researcher with Microsoft Research. Alex is with the University of California, Berkeley, CA. Work done while an intern with Microsoft Research. 1. Develops an EM algorithm for fitting all the parameters of this mo del and an algorithm for dependence discovery and change point detection based on statistical hypothesis testing. 2. Evaluates the performance of the mo del and the inference pro cedures both on synthetic data and on real life data taken from a substantial trace of a large computer network. 3. Formalizes the relationship between CT-NOR and the noisy-or (NOR) gate [11] when the time between the events can be discretized. This paper is organized as follows: Section 2 describes the mo del and Section 3 describes the EM algorithm for fitting the parameters. Section 4 is concerned with the relation to the NOR gate. The algorithms and framework for applying the mo del to dependency discovery and change point detection is described in Section 5. That section also contains validation experiments with synthetic data. Section 6 contains experiments on real data and results. Finally, Section 7 has some conclusions and future work. time of the lth output event and ik the time of the k th input on channel j . Furthermore, let n denote the number of output events and n(j ) the number of input events on channel j . Then event k in input channel j generates a Poisson pro cess (j ) of output events with the base measure pk (t) = (j ) w(j ) f (t - ik ). The term w(j ) represents the average number of output events that we expect each input event on channel j to be responsible for, and f (t) is the distribution of the delay between an input and the output events caused by it, taking as its argument the delay between the time of the output ol and the time of (j ) the input ik . The mathematical structure of the intensity makes intuitive sense: the probability that a given input event caused a given output event depends on both the expected number of events it generates and the "distance" in time between them. We recall that given multiple independent Poisson pro cesses (denoted as P P ) we can use the sum of their intensities to construct a "global" Poisson proj n(j) (j ) cess and write {ol } P P ( k=1 pk (t)) as the probability of the set of n outputs {ol }, 1 l n. The double sum runs over all the channels and over all input events in the channels. Intuitively, and similar to the NOR gate in graphical mo dels [11], the independence between the between input channels translates into a mo del where the events in the output channel are "caused" by the presence of any (a disjunction) of input events in the input channels (with some uncertainty). The formal relation with NOR is presented in Section 4. We now pro ceed to write the likeliho o d of the data given the mo del and the input events. Let = j (j ) (j ) n w , the total mass of the Poisson base measure. The number n of outputs is distributed as a Poisson distribution (j ) 2 The CT-NOR model In this section we formally describe the CT-NOR mo del with the ob jective of building the likeliho o d equation. First, we provide some background on Poisson Pro cesses, and then we use them to construct the mo del (Eq. 4). A Poisson Pro cess1 can be thought of as random pro cess, samples from which take the form of a set of times at which events o ccurred. A Poisson Process is defined over a mean (base) measure f (t) and is characterized the property that for any interval (t1 , t2 ), the number of events that o ccur in that interval follows the Poisson distribution with the param´t eter t12 f (t)dt. Furthermore, the number of events that o ccur on two disjoint intervals are independent. Let us use "channel" to denote a sequence of events. The CT-NOR mo del considers a single output channel and a set of input channels. Let ol denote the 1 This overview is very informal. The more general and formal measure-theoretic definition can b e found in [5]. 2 In the domain of computer networks, a channel refers to a unidirectional flow of networked packets. Thus a channel will b e identified by the service (e.g., HTTP, LDAP, etc) and the IP address of the source or destination. In this pap er we identified the packets with events as it is only their time stamp that matters. n P oisson(), (1) 2 and the lo cation of a specific output event ol is distributed with the probability density ol = j j k =1 k =1 n(j) n(j) pk (ol ) (j ) for l = 1 . . . n (j ) (2) (3) w(j ) f (ol - ik ) The likeliho o d of observing a set {ol } of outputs is3 : ln j =1 3 Fitting a CT-NOR model L(o|i) = n · e- k w(j ) f (ol - ik ) (j ) (4) Before concluding this section, we expand a bit on the function f as it is an important part of the mo del. This function provides us with the opportunity of enco ding domain knowledge regarding the expected shape of the delay between input and output events. In our experience using CT-NOR to mo del an enterprise network we used two specific instantiations: a mixture of a narrow uniform and a decaying exponential and a mixture of a uniform and Gaussian. The uniform distribution captures the expert knowledge that a lot of the proto cols involve a response within a window of time (we call this co-o ccurrence). The Gaussian delay distribution extends the intuitions of co-o ccurrence within a window to also capture dependencies that can be relatively far away in time (such as with the printer). The left tail of the Gaussian corresponding to negative delays is truncated. The exponential distribution captures the intuition that the possibility of dependency decays as the events are further away in time (this is true for the HTTP proto col). We will not explicitly expand these functions in the derivations as they tend to obscure the exposition. Needless to say that the parameters of these functions are all fitted automatically using EM as described in the next section. Groups of channels may have different delay distributions, in which case the delay distribution can be indexed by the channel group and all the derivations in this paper remain the same. For example, channels can be grouped by network service, where all HTTP channels have the same delay distribution (thus allowing data from multiple channels to assist in parameter fitting), but the DNS channels are allowed a different delay distribution. All the experiments in the paper use a leak -- a pseudo-channel with a single event at the start of the observation perio d and a delay distribution that is uniform over the length of the observations. This leak captures events which are not explained by the remaining channels. We perform inference and estimation on the mo del through the EM algorithm. We first set the stage by finding a suitable bounding function B (z ) for the likeliho o d. The EM algorithm iteratively cho oses a tight bound in the E step and then maximizes the (j ) bound in the M step. Let zkl be some positive vector j (j ) (j ) such that k zkl = 1 for each l . For a fixed l , zkl is the probability of the latent state indicating that packet k on channel j caused output l. Then from Eq. 4: log L(o|i) = - + ln log j j w(j ) f (ol - ik ) k (j ) =1 = - + ln log zk l k (j ) w (j ) f (ol - ik ) zk l (j ) (j ) (j ) =1 = - + ln log Ez w(j ) f (ol - ik ) zk l (j ) =1 Now, by Jensen's inequality, log L(o|i) B (z ) where: B (z ) = - + l Ez log w(j ) f (ol - ik ) zk l (j ) (j ) 3. 1 E-Step For a particular choice of (the parameters of the f function) and w(j ) , the bound above is tight when zk l = (j ) w(j ) f (ol - ik ) j k (j ) w(j ) f (ol - ik ) (j ) (j ) zk l (j ) because in that case, w (j) f (ol -ik ) is a constant for a fixed l and E log C = log EC = log C . Therefore, (j ) we use these choice of zkl . 3. 2 M-step (j ) For a fixed choice of zkl , we need to maximize the bound with respect to w(j ) and . Optimizing with respect to w(j ) , we notice that the derivative is l k (j ) 1 B = - n (j ) + z k l (j ) w (j ) w yielding w ^ (j ) Since the Poisson Process produces unordered outputs but the events are considered to b e sorted, a p ermutation factor of n! is required. It cancels out the n! in the Poisson density. 3 = (j ) l zk l n (j ) k With respect to , we can say that j (j ) (j ) ^ zkl log f (ol - ik ) = arg max kl which is simply the parts of the ob jective function that depend on . This can be a very easy optimization problem for a large class of distributions, as it is of the same form as maximum likeliho o d parameter observation given observed data points and corresponding counts. For example, for the exponential family, this simply requires moment matchP (j ) (j ) z k T (o l -i ) ^ ^ ing: µ() = jkl P l (j) k where µ() is the mean j kl of distributions, this parameterization imposes only minor constraints on the weights, but will be useful for reasoning about NOR mo dels which mo del the same data but with differing bin widths. When the bin width is halved, the probability that one of the sub-bins has an output event must be equal to the probability that the large bin has an output event plus a second-order term. This condition is required for a coherent parameterization of a family of NOR distributions and follows from the technical conditions placed on f . We argue that as the bin width decreases, this mo del becomes equivalent to a CT-NOR with a suitable choice of parameters. Cho ose a sufficiently small that each bin contains at most one input event per channel, and at most one output event. We will t use PNOR to denote PNOR (O = 0|Input), the probt ability that the tth bin has no output events falling into it. zk l ^ parameterization of the estimated parameter and T (·) are the sufficient statistics for the family. 4 Relation to Noisy Or As an alternative mo del, consider binning the observed data into windows of width and mo deling the presence or absence of output events in a particular bin as a NOR [11]. The possible explanations (parents) are the presence of input events in preceding windows. We will show that a particular, natural parameterization of the NOR mo del is equivalent to CT-NOR in the limit, as the bin width approaches zero. This relationship is important because it provides a nontrivial extension of NOR to domains with continuous time and provides insight into the independence structure of the two mo dels. Let O be an indicator of presence of output events t (j ) between the times t and t + and It be the indicator for input events from channel j in that same time perio d. We will use PNOR to denote the probability under the NOR mo del and PCT-NOR for probability under CT-NOR. t PNOR = = = js j 0 and 2 otherwise. ¯ Figure 2 shows a quantile-quantile plot of the pvalues (computed using the 2 distribution) under the null hypothesis, computed for causal channels of the same synthetic data as in section 5.1; there are two hours of data with 500 input events per channel per hour. As expected, the quantile-quantile plot forms a straight line, demonstrating that on the synthetic dataset, the null test statistic has a 2 distribution. When a strong changepoint is observed (w(j ) changes from 0.01 to 0.05) , the p-values are very low. When a weak changepoint is observed (w(j ) changes Detecting changepoints is accomplished by testing two hypotheses. The null is that the weights do not change between two time perio ds, and can be written (j ) as wt = w(j ) . Under the alternative, for a particular channel of interest m and an interval of time S , from 0.01 to 0.02) the p-values are lower than under the null distribution but power is significantly lower than when detecting the ma jor changepoint. True Positive Rate ROC for HTTP 1 0.9 5. 3 Bounding the log-likeliho o d ratio 0.8 Computing the log-likeliho o d ratio requires refitting a restricted mo del, though only a small number of EM steps is typically required. However, it is possible to bound the log likeliho o d ratio for dependency discovery very efficiently. For the restricted mo del testing channel m's causality, we must compute the likeliho o d under the constraint that w(m) = 0. Take the estimates of w of the unrestricted mo del and let = -w(m) n(m) . Instead of computing the ratio with the true maximum likeliho o d parameters for the restricted mo del, we propose a set of restricted parameters, and compute the ratio using them. We pro duce a restricted version of parameters w(·) by setting w(m) to zero and inflating the rest by a factor of . That simply corresponds to imposing the restriction, and redistributing the weight among the rest of the parameters, so that the expected number of output packets remains the same. In that case, 0.7 0.6 NoisyOR with bounds NoisyOR with exact computations Unamb. coocc. Std. coocc. 0.5 0 0.1 0.2 FDR 0.3 0.4 0.5 Figure 3: ROC for CT-NOR and competing algorithms on data from a real enterprise network. Both the exact an approximate CT-NOR tests pro duce detection results superior to the alternative metho ds. 6 Results -2 log = -2 log A LMres (Data) LMf ull (Data) j (j ) (j ) l f (ol - ik ) =m,k w -2 log j (j ) (j ) f (ol - ik ) kw 1 = k (m) (m) l w f (ol - ik ) = -2 log - j (j ) (j ) f (ol - ik ) kw 1 l k (j ) -2 log - z ml We describe the results of applying the algorithms of the previous section to a subset of a real dataset consisting of a trace comprising headers and partial payload of around 13 billions packets collected over a 3.5 week perio d in 2005 at Microsoft Research in Cambridge, England. This site contains about 500 networked machines and the trace captures conversations over 2800 off-site IP addresses. Ground-truth for dependence discovery and change point detection is not readily available and it has to be manually generated. We to ok 24 hours of data at the web proxy and manually extracted ground truth for the HTTP traffic at this server by deep inspection of HTTP packets. It is with this part of the data that we validate our algorithms, as it provides us with ob jective metrics, such as precision and recall, to assess the performance of our algorithms. 6. 1 Dep endency Discovery j s a reminder, zml is the latent variable distribution estimated in the E-step of EM. Since the numerator of the log-likeliho o d ratio is a lower bound and the denominator exact, this ex1 ression is a lc wer bound p o k (j ) l z ml orresponds - on . Intuitively, log to the probability that channel m has exactly 0 output events assigned to it when causality is assigned according to the EM distribution on the latent variables . The log term corresponds to the increase in likeliho o d from redistributing channel m's weight among the other channels. First, we are interested in assessing the performance of the dependence discovery capabilities of our mo del and hypothesis testing algorithm. In the application of diagnosis and monitoring of networked systems it is crucial to maintain a consistent map of all the server and services inter-dependencies and their changes. Finding dependencies at the server level is the main building blo ck used by Constellation [2] in building this global map. We compare our metho d to two other alternatives. One is a simple binomial test: for each input channel, we count the number of output packets falling within a W width window of an input packet, and determine whether that number is significantly higher than if the output packets were uniformly distributed. We call this "standard co-o ccurrence." The second alternative considers an input and output channel to be dependent only if there is a unique input packet in the immediate vicinity of an output packet. The reason we select these two alternatives is that a) they reflect (by and large) current heuristics used in the systems community [1] and b) they will capture essentially the "easy" dependencies (as our results indicate).5 As can be seen on the ROC curve in Figure 3, CT-NOR successfully captures 85% of the true correlations with a 1% false positive rate. In total, the mo del detects 95% of the true correlations at 10% of false positives. We want to additionally point out that some of correlations present are very subtle; 13% of the correlations are evidenced by a single output packet. We also point out that CT-NOR performs significantly better than both alternatives based on co-o ccurrence of input packets, providing even more conclusive evidence that CT-NOR is capturing nontrivial dependencies. The approximation error from using the bound of section 5.3 is minimal, while the computation savings are significant. On a relatively slow laptop, the bounds on log-likeliho o d ratio test for a hour of traffic on a busy HTTP proxy can be computed in 7 seconds; exact computations take 86 seconds. 6. 2 Changep oint Detection P-Values 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Uniform Figure 4: Quantile-Quantile plot of changepoint pvalues. The red circles are channel pairs which, according to the ground truth, not exhibit a changepoint. The blue triangles represent channel pairs exhibiting change according to the ground truth. tive hypothesis channels pro duces a large proportion of very small p-values, indicating confidence that a changepoint o ccurred. 7 Conclusions and Future Work Since the true presence or absence of a changepoint is unknown, we estimate it from the actual packet causes, obtained through deep inspection of HTTP packets. We collect a set of input and output channel pairs for which there is no evidence of change. We regard these as coming from the null hypothesis. A set of pairs for which the ground truth provides strong evidence of a change are collected, and considered to be from the alternative hypothesis. We apply our changepoint test to that population, and report the results in Figure 4. The CT-NOR changepoint detection algorithm pro duces uniformly distributed p-values for channels which come from the null hypothesis and do not exhibit a changepoint, confirming that our null hypothesis distribution is calibrated. On the other hand, the test on alterna5 As sometimes an input package generates more than one output packet, we enabled our model to account for this by allowing "autocorrelations" to take place. Namely a packet in an output channel can dep end on an input channel or on the (time-wise) preceding output packet. We presented a generative mo del based on Poisson pro cesses called CT-NOR, to mo del the relationship between events based on the time of their o ccurrences. The mo del is induced from data only containing information about the time stamps for the events. This capability is crucial in the domain of networked systems as collecting any other type of information would entail prohibitive amounts of overhead. Specific domain knowledge about the expected shape of the distribution of the time delay between events can be incorporated to the mo del using a parameterized function. The EM algorithm used to fit the parameters of the mo del given the data also induces the parameters of this function. The combination of knowledge engineering and learning from data is clearly exemplified in the application we presented to the domain of computer systems, where we used a mixture mo del consisting of an exponential and a uniform distribution. In terms of applying the mo del we fo cused on providing building blo cks for diagnosis and monitoring. We provided algorithms based on statistical hypothesis testing for (a) discovering the dependencies between input and output channels in computer networks, and for (b) finding changes in expected behavior (change-point detection). We validated these algorithms first on synthetic data, and then on a subset (HTTP traffic) of a trace of real data from events in a corporate communication network containing 500 computers and servers. The relationship presented in Section 4 between CT-NOR and the NOR gate is interesting for multiple reasons. First, as the NOR gate has been extensively studied in this community in mo deling and learning environment and in causal discovery [4], the immediate benefits are a) increasing the applicability to continuous time, and b) augmenting its mo deling capabilities using the time delay functions used in this work. Second, this correspondence provide us with another intuition on the independence assumptions behind the Poisson pro cess, as applied to the characterization of the relationship between the events in various inputs to the events in a specific output. For the particular application of dependency discovery between channels in a computer network we explored a varied set of alternative approaches. They all failed miserably. Among these, we briefly discuss two: We cast the problem as one of classification, and tried a host of Bayesian network classifiers [6]. The idea was to first discretize time into suitable perio ds, and then have as features the existence or absence of events in the input channels and as the class the existence or absence of events in the output channel. The accuracy was abysmal. The main problem with this approach is that the communication in these networks is bursty by nature with relatively large perio ds of quiet time. Once we started to lo ok at Poisson as the appropriate way to quantify the distributions in these classifiers the choice of the Poisson pro cess became clear. We also explored the use of hypothesis testing comparing the inter-time between events in the input and output channels to the inter-time between the input and a fictitious random channel. The accuracy in terms of false positives and true positives was worse than those based on co-o ccurrence. The main problem here is that we are considering pairwise interactions and there are many confounder in all the other channels. With regards to related approaches, both the work on continuous time Bayesian networks [10] and in general about dynamic Bayesian networks (e.g., [9]) are obviously very different in terms of the parameterization of the mo dels, the assumptions, and the intended application. The work that is closest to ours is contained in the paper by Ra jaram et al [12] where they propose a (graphical) mo del for point pro cesses in terms of Poisson Networks. The main difference between their work and ours is the tradeoff between representation capabilities and complexity in inference that the different fo ci of our respective papers entails. Due to the distributed nature of our application domain, we concentrate on mo deling the "families" (lo cal parent/child relationship) and basically assume that we can reconstruct, in a distributed manner based on the lo cal information, the topology of the network. This enables us to induce families with large numbers of parents, and with relatively complex interactions as given by the delay function f , while performing inference efficiently. In the Poisson Networks paper [12], the number of parents of each no de are restricted, and the rate function is parameterized by a generalized linear mo del. Even with these (relatively benign) restrictions inference is non-trivial in terms of finding the structure of the Bayesian network and indeed this is a contribution of that paper. Obviously, future work includes merging both approaches: an immediate benefit would be to decrease the vulnerability of our approach to spurious causal dependencies due to ignoring the global structure in the estimation. There are other three threads that we are currently investigating for future work. The first one involves recasting the fitting and inference pro cedures described in the mo del in the Bayesian framework. An advantage of the Bayesian approach will be on the inclusion of priors. As channels differ greatly on the number of events this can further increase the accuracy of discovery. A second direction is that of incorporating False Discovery Rates [3] calculations in order to accurately estimate false positives when we don't have ground truth regarding the relationship between the channels. As we are performing a large number of hypothesis tests, this becomes a necessity. In [2] we experimented with the basic approach described in [3], and we verified that the approach is very conservative in the context of the HTTP and DNS proto cols where we do have ground truth. We plan to explore less conservative approaches such as the one described in [13] or adapt the one explored in [8]. Finally we are in the pro cess of getting suitable data and plan to apply this mo del to biological networks such as neurons that communicate with other neurons using spikes in electrical potential. 8 Acknowledgments We thank T. Graepel for comments on a previous version of this paper. We are also grateful for the helpful suggestions of the anonymous reviewers which we hope we have addressed to their satisfaction. [13] J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):479­498, 2002. [14] L. Wasserman. Al l of Statistics. Springer, 2004. References [1] V. Bahl, R. Chandra, A. Greenb erg, S. Kandula, D. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multilevel dep endencies. In SIGCOMM'07, Aug. 2007. [2] P. Barham, R. Black, M. Goldszmidt, R. Isaacs, J. MacCormick, R. Mortier, and A. Simma. Constellation: automated discovery of service and host dep endencies in networked systems. Technical Rep ort MSR-TR-2008-67, Microsoft Research, 2008. [3] Y. Benhamini and Y. Hochb erg. Controlling the false discovery rate: a practical and p owerful approach to multiple testing. Journal of the Royal Statistical Society, 57(2-3):125­133, 1995. [4] J. Breese and D. Heckerman. Causal indep endence for probability assessment and inference in Bayesian networks. IEEE Transactions on Systems, Man, and Cybernetics, 1996. [5] R. Durrett. Probability: Theory and Examples. Duxbury Press, second edition, 1996. [6] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29(2-3):131­163, 1997. [7] B. Lindsay. Mixture Models: Theory, Geometry, and Applications. Institute of Mathematical Statistics, Hayward, 1995. [8] J. Listgarten and D. Heckerman. Determining the numb er of non-spurious arcs in a learned DAG model: Investigation of a Bayesian and a frequentist approach. In Proceedings of the International Conference on Uncertainty in Artificial Intel ligence, 2007. [9] K. Murphy. Dynamic Bayesian networks: Representation, inference, and learning. Technical rep ort, UC Berkeley, 2001. PhD Thesis. [10] U. Nodelman, C. Shelton, and D. Koller. Learning continuous time Bayesian networks. In Proceedings of the International Conference on Uncertainty in Artificial Intel ligence, 2003. [11] J. Pearl. Probabilistic Reasoning in Intel ligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [12] S. Ra jaram, T. Graep el, and R. Herbrich. Poissonnetworks: A model for structured p oint processes. In Proceedings of the International Workshop on Artificial Intel ligence and Statistics, 2005.