Selecting Observations against Adversarial Objectives Andreas Krause SCS, CMU H. Brendan McMahan Google, Inc. Carlos Guestrin SCS, CMU Anupam Gupta SCS, CMU Abstract In many applications, one has to actively select among a set of expensive observations before making an informed decision. Often, we want to select observations which perform well when evaluated with an objective function chosen by an adversary. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for the case where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We evaluate our algorithm on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. 1 Introduction In tasks such as sensor placement for environmental temperature monitoring or experimental design, one has to select among a large set of possible, but expensive, observations. Often, there are several different objective functions which we want to simultaneously optimize. For example, in the environmental monitoring problem, we want to minimize the marginal posterior variance of our temperature estimate at all locations simultaneously. In experimental design, we often have uncertainty about the model parameters, and we want our experiments to be informative no matter what the true parameters of the model are. These problems can be interpreted as a game: We select a set of observations (sensor locations, experiments), and an adversary selects an objective function (location to evaluate predictive variance, model parameters etc.) to test us on. Often, the individual objective functions (e.g., the marginal variance at one location, or the information gain for a fixed set of parameters [1, 2]) satisfy submodularity, an intuitive diminishing returns property: Adding a new observation helps less if we have already made many observations, and more if we have made few observation thus far. While NP-hard, the problem of selecting an optimal set of k observations maximizing a single submodular objective can be approximately solved using a simple greedy forwardselection algorithm, which is guaranteed to perform near-optimally [3]. However, as we show, this simple myopic algorithm performs arbitrarily badly in the case of an adversarially chosen objective. In this paper, we address this problem. In particular: (1) We present S AT U R AT E, an efficient algorithm for settings where an adversarially-chosen submodular objective function must be optimized. Our algorithm guarantees solutions which are at least as informative as the optimal solution, at only a slightly higher cost. (2) We prove that our approximation guarantee is best possible and cannot be improved unless NP-complete problems admit efficient algorithms. (3) We extensively evaluate our algorithm on several real-world tasks, including minimizing the maximum posterior variance in Gaussian Process regression, finding experiment designs which are robust with respect to parameter uncertainty, and sensor placement for outbreak detection. 2 The adversarial observation selection problem Observation selection with a single submodular objective. Observation selection problems can often be modeled using set functions: We have a finite set V of observations to choose from, and 1 a utility function F which assigns a real number F (A) to each A V , quantifying its informativeness. In many settings, such as the ones described above, the utility F exhibits the property of submodularity: adding an observation helps more, the fewer observations made so far [2]. Formally, F is submodular [3] if, for all A B V and s V \ B, it holds that F (A {s}) - F (A) F (B {s}) - F (B ); F is monotonic if for all A B V it holds that F (A) F (B ), and F is normalized if F () = 0. Hence, many observation selection problems can be formalized as max F (A), AV subject to |A| k , (2.1) where F is normalized, monotonic and submodular, and k is a bound on the number of observations we can make. Since solving the problem (2.1) is generally NP-hard [4], in practice heuristics are often used. One such heuristic is the greedy algorithm. This algorithm starts with the empty set, and iteratively adds the element s = argmaxsV \A F (A {s}), until k elements have been selected. Perhaps surprisingly, a fundamental result by Nemhauser et. al. [3] states that for submodular functions, the greedy algorithm achieves a constant factor approximation: The set AG obtained by the greedy algorithm achieves at least a constant fraction (1 - 1/e) of the objective value obtained by the optimal solution, i.e., F (AG ) (1 - 1/e) max|A|k F (A). Moreover, no polynomial time algorithm can provide a better approximation guarantee unless P = NP [4]. Observation selection with adversarial objectives. In many applications (such as those discussed below), one wants to simultaneously optimize multiple objectives. Here, we are given a collection of monotonic submodular functions F1 , . . . , Fm , and we want to solve max min Fi (A), AV i subject to |A| k . (2.2) Problem (2.2) can be considered a game: First, we (the max-player) select a set of observations A, and then our opponent (the min-player) selects a criterion Fi to test us on. Our goal is to select a set A of observations which performs well against an opponent who chooses the worst possible Fi knowing our choice A. Thereby, we try to find a pure equilibrium to a sequential game on a matrix, with one row per A, and one column per Fi . Note, that even if the Fi are all submodular, G(A) = mini Fi (A) is not submodular. In fact, we show below that, in this setting, the simple greedy algorithm (which performs near-optimally in the single-criterion setting) can perform arbitrarily badly. Examples of adversarial observation selection problems. We consider three instances of adversarial selection problems. Sec. 4 provides more details and experimental results for these domains. Minimizing the maximum Kriging variance. Consider a Gaussian Process (GP) [5] XV defined over a finite set of locations (indices) V . Hereby, XV is a set of random variables, one variable Xs for each location s V . Given a set of locations A V which we observe, we can compute the predictive distribution P (XV \A | XA = xA ), i.e., the distribution of the variables XV \A at the unobserved locations V \ A, conditioned on the measurements at the selected locations, XA = xA . 2 Let s|A be the residual variance after making observations at A. Let AA be the covariance matrix of the measurements at the chosen locations A, and sA be the vector of cross-covariances 2 2 between the measurements at s and A. Then, the variance s|A = s - sA -1 As depends AA 2 only on the set A, and not on the observed values xA . Assume that the a priori variance s is constant for all locations s (in Sec. 3, we show our approach generalizes to non-constant marginal variances). We want to select locations A such that the maximum marginal variance is as small as 2 2 possible. Equivalently, we can define the variance reduction Fs (A) = s - s|A , and desire that the minimum variance reduction over all locations s is as large as possible. Das and Kempe [1] show that, in many practical cases, the variance reduction Fs is a monotonic submodular function. Robust experimental designs. Another application is experimental design under nonlinear dynamics [6]. The goal is to estimate a set of parameters of a nonlinear function y = f (x, ) + w, by providing a set of experimental stimuli x, and measuring the (noisy) response y . In many cases, experimental design for linear models (where y = A(x)T + w) with Gaussian noise w can be efficiently solved [7]. In the nonlinear case, the common approach is to linearize f around an initial parameter estimate 0 , i.e., y = f (x, 0 ) + V (x)( - 0 ) + w, where V (x) is the Jacobian of f with respect to the parameters , evaluated at 0 . In [6], it was shown that the efficiency of the design can be very sensitive with respect to the initial parameter estimates 0 . Consequently, they develop an efficient semi-definite program (SDP) for E-optimal design (i.e., the goal is to minimize the maximum eigenvalue of the error covariance) which is robust against perturbations of the Jacobian 2 V . However, it might be more natural to directly consider robustness with respect to perturbation of the initial parameter estimates 0 , around which the linearization is performed. We show how to find (Bayesian A-optimal) designs which are robust against uncertainty in these parameter estimates. In this setting, the objectives F0 (A) are the reductions of the trace of the parameter covariance, ( ) ( 0 ) F0 (A) = tr( 0 ) - tr(|A ), where (0 ) is the joint covariance of observations and parameters after linearization around 0 ; thus, F0 is the sum of marginal parameter variance reductions, which are individually monotonic and submodular [1], and so F0 is monotonic and submodular as well. Hence, in order to find a robust design, we maximize the minimum variance reduction, where the minimum is taken over (a discretization into a finite subset of) all initial parameter values 0 . Sensor placement for outbreak detection. Another class of examples are outbreak detection problems on graphs, such as contamination detection in water distribution networks [8]. Here, we are given a graph G = (V , E ), and a phenomenon spreading dynamically over the graph. We define a set of intrusion scenarios I ; each scenario i I models an outbreak (e.g., spreading of contamination) starting from a given node s V in the network. By placing sensors at a set of locations A V , we can detect such an outbreak, and incur a utility Fi (A) (e.g., reduction in detection time or population affected). In [8], it was shown that these utilities Fi are monotonic and submodular for a large class of utility functions. In the adversarial setting, the adversary observes our sensor placement A, and then decides on an intrusion i for which our utility Fi (A) is as small as possible. Hence, our goal is to find a placement A which performs well against such an adversarial opponent. Hardness of the adversarial observation selection problem. Given the near-optimal performance of the greedy algorithm for the single-objective problem, a natural question is if the performance guarantee generalizes to the more complex adversarial setting. Unfortunately, this is far from true. Consider the case with two submodular functions, F1 and F2 , where the set of observations is V = {s1 , s2 , t1 , t2 }. We set F1 () = F2 () = 0, and define F1 (A) = 1 if s1 A, otherwise times the number of ti contained in A. Similarly, if s2 A, we set F2 (A) = 1, otherwise times the number of ti contained in A. Both F1 and F2 are submodular and monotonic. Optimizing for a set of 2 elements, the greedy algorithm maximizing G(A) = min{F1 (A), F2 (A)} would choose the set {t1 , t2 }, since such choice increases G by 2, whereas adding si would not increase the score. However, the optimal solution with k = 2 is {s1 , s2 }, with a score of 1. Hence, as 0, the greedy algorithm performs arbitrarily worse than the optimal solution. Our next hope would be to obtain a different good approximation algorithm. However, we can show that most likely this is not possible: Theorem 1. Unless P = NP, there cannot exist any polynomial time approximation algorithm for Problem (2.2). More precisely: Let n be the size of the problem instance, and (·) > 0 be any positive function of n. If there exists a polynomial-time algorithm which is guaranteed to find a set A of size k such that mini Fi (A ) (n) max|A|k mini Fi (A), then P = NP. Thus, unless P = NP, there cannot exist any algorithm which is guaranteed to provide, e.g., even an exponentially small fraction ( (n) = 2-n ) of the optimal solution. All proofs can be found in [9]. 3 The Submodular Saturation Algorithm Since Theorem 1 rules out any approximation algorithm which respects the constraint k on the size of the set A, our only hope for non-trivial guarantees requires us to relax this constraint. We now present an algorithm that finds a set of observations which perform at least as well as the optimal set, but at slightly increased cost; moreover, we show that no efficient algorithms can provide better guarantees (under reasonable complexity-theoretic assumptions). For now we assume all Fi take only integral values; this assumption is relaxed later. The key idea is to consider the following alternative formulation: max c, c,A subject to c Fi (A) for 1 i m and |A| k . (3.1) We want a set A of size at most k , such that Fi (A) c for all i, and c is as large as possible. Here 1 is a parameter relaxing the constraint on |A|: if = 1, we recover the original problem (2.2). We solve program (3.1) as follows: For each value c, we find the cheapest set A with Fi (A) c for all i. If this cheapest set has at most k elements, then c is feasible. A binary search on c allows us to find the optimal solution with the maximum feasible c. We first show how to approximately solve Equation (3.1) for a fixed c. For c > 0 define Fi,c (A) = min{Fi (A), c}, the original function Fi truncated at score level c; these Fi,c functions are also submodular [10]. 3 GPC (F c , c) A ; while F c (A) < c do foreach s V \ A do s F c (A {s}) - F c (A); A A {argmaxs s }; Algorithm 1: The greedy submodular partial cover (G P C) algorithm. S AT U R AT E (F1 , . . . , Fm , k , ) cmin 0; cmax mini Fi (V ); Abest ; 1 while (cmax - cmin ) m do i 1 c (cmin + cmax )/2; A define F c (A) m min{Fi (A), c}; A GP C (F c , c); if |A| > k then cmax c; else cmin c; Abest = A ; Algorithm 2: The Submodular Saturation algorithm. iF 1 Let F c (A) = m i,c (A) be their average value; submodular functions are closed under convex combinations, so F c is submodular and monotonic. Furthermore, Fi (A) c for all 1 i m if and only if F c (A) = c. Hence, in order to determine whether some c is feasible, we solve a submodular covering problem: (3.2) Ac = argminAV |A|, such that F c (A) = c. Such problems are NP-hard in general [4], but in [11] it is shown that the greedy algorithm (c.f., Algorithm 1) achieves near-optimal performance on this problem. Using this result, we find: Lemma 2. Given monotonic submodular functions F1 , . . . , Fm and a (feasible) constant c, AlgoA e rithm 1 (with input F c ) finds a set AG such that Fi (AG ) c for all i, and |m G | |A |, wh1 re i A is the optimal solution, and = 1 + log (maxsV Fi (s)) 1 + log maxsV F c (s) . We can compute this approximation guarantee for any given instance of the adversarial observation selection problem. Hence, if for a given value of c the greedy algorithm returns a set of size greater than k , there cannot exist a solution A with |A | k with Fi (A ) c for all i; thus, the optimal solution to the adversarial observation selection problem must be less than c. We can use this argument to conduct a binary search to find the optimal value of c. We call Algorithm 2, which formalizes this procedure, the submodular saturation algorithm (S AT U R AT E), as the algorithm considers the truncated objectives Fi,c , and chooses sets which saturate all these objectives. Theorem 3 (given below) states that S AT U R AT E is guaranteed to find a set which achieves adversarial score mini Fi at least as high as the optimal solution, if we allow the set to be logarithmically larger than the optimal solution. Theorem 3. For any integer k , S AT U R AT E finds a solution AS siuch that mini Fi (AS ) max|A|k mini Fi (A) and |AS | k , f| r = 1 + log (maxsV o Fi (s)). The total number . i of submodular function evaluations is O V |2 m log( Fi (V )) Note, that the algorithm still makes sense for any value of . However, if < i 1 + log (maxsV Fi (s)), the guarantee of Theorem 3 does not hold. If we had an exact algorithm for submodular coverage, = 1 would be the correct choice. Since the greedy algorithm solves submodular coverage very effectively, in our experiments, we call S AT U R AT E with = 1, which empirically performs very well. The worst-case running time guarantee is quite pessimistic, and in practice the algorithm is much faster: Using a priority queue and lazy evaluations, Algorithm 1 can be sped up drastically (c.f., [12] for details). Furthermore, in practical implementations, one would stop G P C once k + 1 elements have been selected, which already proves that the optimal solution with k elements cannot achieve score c. Also, Algorithm 2 can be terminated once cmax - cmin is sufficiently small; in our experiments, 10-15 iterations usually sufficed. One might ask, whether the guarantee on the size of the set, , can be improved. Unfortunately, this is not likely, as the following Theorem shows: Theorem 4. If there were a polynomial time algorithm which, for any integer k , is guaranteed to find a solution AS such that mini Fi (AS ) max|A|k mini Fi (A) and |AS | k , where i (1 - )(1 + log maxsV Fi (s)) for some fixed > 0, then NP DTIME(nlog log n ). 1 This bound is only meaningful for integral Fi , otherwise it could be arbitrarily improved by scaling the Fi . 4 Hereby, DTIME(nlog log n ) is a class of deterministic, slightly superpolynomial (but subexponential) algorithms [4]; the inclusion NP DTIME(nlog log n ) is considered unlikely [4]. Extensions. We now show how the assumptions made in our presentation above can be relaxed. Non-integral objectives. Most objective functions Fi in the observation selection setting are not integral (e.g., marginal variances of GPs). If they take rational numbers, we can scale the objectives by multiplying by their common denominator. If we allow small additive error, we can approximate their values by their leading digits. An analysis similar to the one presented in [2] can be used to bound the effect of this approximation on the theoretical guarantees obtained by the algorithm. Non-constant thresholds. Consider the example of Minimax Kriging Designs for GP regression. 2 2 Here, the Fi (A) = i - i|A denote the variance reductions at location i. However, rather than guaranteeing that Fi (A) c for all i (which, in this example, means that the minimum variance re2 duction is c), we want to guarantee that i|A c for all i. We can easily adapt our approach to handle 2 this case: Instead of defining Fi,c (A) = min{Fi (A), c}, we define Fi,c (A) = min{Fi (A), i - c}, and then again perform binary search over c, but searching for the smallest c instead. The algorithm, using objectives modified in this way, will bear the same approximation guarantees. Non-uniform observation costs. We can extend S AT U R AT E to the setting where different observations have different costs. Suppose a cost function g : V R+ ass igns each element s V a poss itive cost g (s); the cost of a set of observations is then g (A) = A g (s). The problem is to find A = maxAV mini Fi (A) subject to g (A) B , wFere B > 0 is a budget we can spend on makh / ing observations. In this case, we use the rule s g (s) in Algorithm 1. c (A {s}) - F c (A) For this modified algorithm, Theorem 3 still holds, with |A| replaced by g (A) and k replaced by B . 4 Experimental Results Minimax Kriging. We use S AT U R AT E to select observations in a GP to minimize the maximum posterior variance. We consider Precipitation data from the Pacific Northwest of the United States [13]. We discretize the space into 167 locations. In order to estimate variance reduction, we consider the empirical covariance of 50 years of data, which we preprocessed as described in [2]. In the geostatistics literature, the predominant choice of optimization algorithms are carefully tuned local search procedures, prominently simulated annealing (c.f., [14, 15]). We compare our S AT U R AT E algorithm against a state-of-the-art implementation of such a simulated annealing (SA) algorithm, first proposed by [14]. We use an optimized implementation described recently by [15]. This algorithm has 7 parameters which need to be tuned, describing the annealing schedule, distribution of iterations among several inner loops, etc. We use the parameter settings as reported by [15], and report the best result of the algorithm among 10 random trials. In order to compare observation sets of the same size, we called S AT U R AT E with = 1. Fig. 1(a) compares simulated annealing, S AT U R AT E, and the greedy algorithm which greedily selects elements which decrease the maximum variance the most. We also used S AT U R AT E to initialize the simulated annealing algorithm (using only a single run of simulated annealing, as opposed to 10 random trials). S AT U R AT E obtains placements which are drastically better than the placements obtained by the greedy algorithm. Furthermore, the performance is very close to the performance of the simulated annealing algorithm. When selecting 30 and more sensors, S AT U R AT E strictly outperforms the simulated annealing algorithm. Furthermore, as Fig. 1(b) shows, S AT U R AT E is significantly faster than simulated annealing, by factors of 5-10 for larger problems. When using S AT U R AT E in order to initialize the simulated annealing algorithm, the resulting performance almost always resulted in the best solutions we were able to find, while still executing faster than simulated annealing with 10 random restarts as proposed by [15]. These results indicate that S AT U R AT E compares favorably to state-of-the-art local search heuristics, while being faster, requiring no parameters to tune, and providing theoretical approximation guarantees. Optimizing for the maximum variance could potentially be considered too pessimistic. Hence we compared placements obtained by S AT U R AT E, minimizing the maximum marginal posterior variance, with placements obtained by the greedy algorithm, where we minimize the average marginal variance. Note, that, whereas the maximum variance reduction is non-submodular, the average variance reduction is submodular in a wide range of GPs [1], and hence the greedy algorithm can be expected to provide near-optimal placements. Fig. 1(c) presents the maximum and average marginal variances for both algorithms. Our results show that if we optimize for the maximum variance we still achieve comparable average variance. If we optimize for average variance however, the 5 500 Maximum marginal variance 2.5 400 2 1.5 1 0.5 0 Saturate Saturate + SA 20 40 60 Number of sensors 80 100 0 0 10 Greedy Marginal variance Running time (s) Simulated Annealing (SA) 300 200 100 Greedy 20 30 40 50 Number of observations 60 Saturate+SA Saturate 3 2.5 2 1.5 1 0.5 0 0 Avg. var. opt. max. (Saturate) 5 Max. var. opt. avg. Max. var. (Greedy) opt. max. (Saturate) Simulated Annealing (SA) Avg. var. opt. avg. (Greedy) 20 10 15 Number of sensors (a) Algorithm comparison (b) Running time (c) Avg. vs max. variance Figure 1: (a) S AT U R AT E, greedy and SA on the precipitation data. S AT U R AT E performs comparably with the fine-tuned SA algorithm, and outperforms it for larger placements. (b) Running times for the same experiment. (c) Optimizing for the maximum variance (using S AT U R AT E) leads to low average variance, but optimizing for average variance (using greedy) does not lead to low maximum variance. maximum posterior variance remains much higher. In the longer version of this paper [9], we present results on two more real data sets, which are qualitatively similar to those discussed here. Robust Experimental Design. We consider the robust design of experiments for the MichaelisMenten mass-action kinetics model, as discussed in [6]. The goal is least-square parameter estimation for a function y = f (x, ), where x is the chosen experimental stimulus (the initial substrate concentration S0 ), and = (1 , 2 ) are two parameters as described in [6]. The stimulus x is chosen from a menu of six options, x {1/8, 1, 2, 4, 8, 16}, each of which can be repeatedly chosen. The goal is to produce a fractional design w = (w1 , . . . , w6 ), where each component wi measures the relative frequency according to which the stimulus xi is chosen. Since f is nonlinear, f is linearized around an initial parameter estimate 0 = (01 , 02 ), and approximated by its Jacobian V0 . Classical experimental design considers the error covariance of the least squares ^ ^ estimate , Cov( | 0 , w) = 2 (VT W V0 )-1 , where W = diag(w), and aims to find designs 0 w which minimize this error covariance. E-optimality, the criterion adopted by [6], measures smallness in terms of the maximum eigenvalue of the error covariance matrix. The optimal w can be found using Semidefinite Programming (SDP) [7]. ^ The estimate Cov( | 0 , w) depends on the initial parameter estimate 0 , where linearization is performed. However, since the goal is parameter estimation, a "certain circularity is involved" [6]. To avoid this problem, [6] find a design w (0 ) by solving a robust SDP which minimizes the error size, subject to a worst-case (adversarially-chosen) perturbation on the Jacobian V0 ; the robustness parameter bounds the spectral norm of . As evaluation criterion, [6] define a notion of efficiency, which is the error size of the optimal design with correct initial parameter estimate, divided by the error when using a robust design obtained at the wrong initial parameter estimates, i.e., ^ ^ efficiency max [Cov( | true , wopt (true )))]/max [Cov( | true , w (0 ))], where wopt () is the E-optimal design for parameter . They show that for appropriately chosen values of , the robust design is more efficient than the optimal design, if the initial parameter 0 does not equal the true parameter. While their results are very promising, an arguably more natural approach than perturbing the Jacobian would be to perturb the initial parameter estimate, around which linearization is performed. E.g., if the function f describes a process, which behaves characteristically differently in different "phases", and the parameter controls which of the phases the process is in, then a robust design should intuitively "hedge" the design against the behavior in each possible phase. In such a case, the uniform distribution (which the robust SDP chooses for large ) would not be the most robust design. If we discretize the space of possible parameter perturbations (within a reasonably chosen interval), we can use S AT U R AT E to find robust experimental designs. While the classical E-optimality is not submodular [2], Bayesian A-optimality is (usually) submodular [1, 2]. Here, the goal is to minimize the trace instead of eigenvalue size as error metric. Furthermore, we equip the parameters with an uninformative normal prior (which we chose as diag([202 , 202 ])), and then minimize the expected trace of the posterior error covariance, tr(|A ). Hereby, A is a discrete design of 20 experiments, where each option xi can be chosen repeatedly. In order to apply S AT U R AT E, for each , we define 1 2 2 F (A) as the normalized variance reduction F (A) = Z ( - |A ). The normalization Z is chosen such that F (A) = 1 if A = argmax|A |=20 F (A ), i.e., if A is chosen to maximize only F . S AT U R AT E is then used to maximize the worst-case normalized variance reduction. 6 2500 2000 1500 1000 500 0.8 0.6 0.4 Saturate: large interval 0.2 small interval 0 10 -1 Greedy Maximum population affected 1 Maximum detection time (minutes) A Efficiency (w.r.t. E-optimality) B C 3000 2 x 10 4 1.5 Greedy 1 Saturate Simulated Annealing SDP: = 16.3 = 10-3 0 10 Initial parameter estimate 02 Classical E-optimal design true 2 0.5 Simulated Annealing 5 Saturate + SA 10 15 20 Number of sensors 25 30 10 1 Saturate 0 0 5 10 15 20 Number of sensors 25 30 0 (a) Robust experimental design (b) [W] algorithms Z1 (c) [W] algorithms Z2 Figure 2: (a) Efficiency of robust SDP of [6] and S AT U R AT E on a biological experimental design problem. For a large range of initial parameter estimates, S AT U R AT E outperforms the SDP solutions. (b,c) S AT U R AT E, greedy and SA in the water network setting, when optimizing worst-case detection time (Z1 ) and affected population (Z2 ). S AT U R AT E performs comparably to SA for Z2 and strictly outperforms SA for Z1 . We reproduced the experiment of [6], where the initial estimate of the second component 02 of 0 was varied between 0 and 16, the "true" value being 2 = 2. For each initial estimate of 02 , we computed a robust design, using the SDP approach and using S AT U R AT E, and compared them using the efficiency metric of [6]. We first optimized designs which are robust against a small perturbation of the initial parameter estimate. For the SDP, we chose a robustness parameter = 10-3 , as 1 reported in [6]. For S AT U R AT E, we considered an interval around [ 1+ , (1 + )], discretized in a 5 × 5 grid, with = .1. Fig. 2(a) shows three characteristically different regions, A, B , C , separated by vertical lines. In region B which contains the true parameter setting, the E-optimal design (which is optimal if the true parameter is known, i.e., 02 = 2 ) performs similar to both robust methods. Hence, in region B (i.e., small deviation from the true parameter), robustness is not really necessary. Outside of region B however, where the standard E-optimal design performs badly, both robust designs do not perform well either. This is an intuitive result, as they were optimized to be robust only to small parameter perturbations. Consequently, we compared designs which are robust against a large parameter range. For SDP, we chose = 16.3, which is the maximum spectral variation of the Jacobian when we consider all initial estimates from 02 varying between 0 and 16. For S AT U R AT E, we optimized a single design which achieves the maximum normalized variance reduction over all values of 02 between 0 and 16. Fig. 2(a) shows, that in this case, the design obtained by S AT U R AT E achieves an efficiency of 69%, whereas the efficiency of the SDP design is only 52%. In the regions A and C, the S AT U R AT E design strictly outperforms the other robust designs. This experiment indicates that designs which are robust against a large range of initial parameter estimates, as provided by S AT U R AT E, can be more efficient than designs which are robust against perturbations of the Jacobian (the SDP approach). Outbreak Detection. Consider a city water distribution network, delivering water to households via a system of pipes, pumps, and junctions. Accidental or malicious intrusions can cause contaminants to spread over the network, and we want to select a few locations (pipe junctions) to install sensors, in order to detect these contaminations as quickly as possible. In August 2006, the Battle of Water Sensor Networks (BWSN) [16] was organized as an international challenge to find the best sensor placements for a real (but anonymized) metropolitan water distribution network, consisting of 12,527 nodes. In this challenge, a set of intrusion scenarios is specified, and for each scenario a realistic simulator provided by the EPA [17] is used to simulate the spread of the contaminant for a 48 hour period. An intrusion is considered detected when one selected node shows positive contaminant concentration. BWSN considered a variety of impact measures, including the time to detection (called Z1 ), and the size of the affected population calculated using a realistic disease model (Z2 ). The goal of BWSN was to minimize the expectation of the impact measures Z1 and Z2 given a uniform distribution over intrusion scenarios. In this paper, we consider the adversarial setting, where an opponent chooses the contamination scenario with knowledge of the sensor locations. The objective functions Z1 and Z2 are in fact submodular for a fixed intrusion scenario [8], and so the adversarial problem of minimizing the impact of the worst possible intrusion fits into our model. For these experiments, we consider scenarios which affect at least 10% of the network, resulting in a total of 3424 scenarios. Figures 2(b) and 2(c) compare the greedy algorithm, S AT U R AT E and the simulated annealing (SA) algorithm for the problem of maximizing the worst-case detection time (Z1 ) and worst-case affected population (Z2 ). Interestingly, the behavior is very different for the two objectives. For the affected population (Z2 ), greedy performs reasonably, and SA sometimes even outperforms S AT U R AT E. For the detection 7 time (Z1 ), however, the greedy algorithm did not improve the objective at all, and SA performs poorly. The reason is that for Z2 , the maximum achievable scores, Fi (V ), vary drastically, since some scenarios have much higher impact than others. Hence, there is a strong "gradient", as the adversarial objective changes quickly when the high impact scenarios are covered. This gradient allows greedy and SA to work well. On the contrary, for Z1 , the maximum achievable scores, Fi (V ), are constant, since all scenarios have the same simulation duration. Unless all scenarios are detected, the worst-case detection time stays constant at the simulation length. Hence, many node exchange proposals considered by SA, as well as the addition of a new sensor location by greedy, do not change the adversarial objective, and the algorithms have no useful performance metric. Similarly to the GP Kriging setting, our results show that optimizing the worst-case score leads to reasonable performance in the average case score, but not necessarily vice versa. 5 Conclusions In this paper, we considered the problem of selecting observations which are informative with respect to an objective function chosen by an adversary. We demonstrated how this class of problems encompasses the problem of finding designs which minimize the maximum posterior variance in Gaussian Processes regression, robust experimental design, and detecting events spreading over graphs. In each of these settings, the individual objectives are submodular and can be approximated well using, e.g., the greedy algorithm; the adversarial objective, however, is not submodular. We proved that there cannot exist any approximation algorithm for the adversarial problem if the constraint on the observation set size must be exactly met, unless P = NP. Consequently, we presented an efficient approximation algorithm, S AT U R AT E, which finds observation sets which are guaranteed to be least as informative as the optimal solution, and only logarithmically more expensive. In a strong sense, this guarantee is the best possible. We extensively evaluated our algorithm on several real-world problems. For Gaussian Process regression, we showed that S AT U R AT E compares favorably to state-of-the-art heuristics, while being simpler, faster, and providing theoretical guarantees. For robust experimental design, S AT U R AT E performs favorably compared to SDP based approaches. Acknowledgements This work was partially supported by NSF Grants No. CNS-0509383, CNS0625518, CCF-0448095, CCF-0729022, and a gift from Intel. Anupam Gupta and Carlos Guestrin were partly supported by an Alfred P. Sloan Fellowship, Carlos Guestrin by an IBM Faculty Fellowship and Andreas Krause by a Microsoft Research Graduate Fellowship. References [1] A. Das and D. Kempe. Algorithms for subset selection in linear regression. In under review for FOCS. [2] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. In To appear in the JMLR, 2007. [3] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14:265­294, 1978. [4] U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4), 1998. [5] C. E. Rasmussen and C. K. I. Williams. Gaussian Process for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2006. [6] P. Flaherty, M. Jordan, and A. Arkin. Robust design of biological experiments. In NIPS, 2006. [7] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge UP, March 2004. [8] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, 2007. [9] A. Krause, B. McMahan, C. Guestrin, and A. Gupta. Selecting observations against adversarial objectives. Technical report, CMU-ML-07-118, 2007. [10] T. Fujito. Approximation algorithms for submodular set cover with applications. TIEICE, 2000. [11] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2:385­393, 1982. [12] T. G. Robertazzi and S. C. Schwartz. An accelerated sequential algorithm for producing D-optimal designs. SIAM Journal of Scientific and Statistical Computing, 10(2):341­358, March 1989. [13] M. Widmann and C. S. Bretherton. 50 km resolution daily precipitation for the pacific northwest. http://www.jisao.washington.edu/data sets/widmann/, May 1999. [14] J. Sacks and S. Schiller. Statistical Decision Theory and Related Topics IV, Vol. 2. Springer, 1988. [15] D. P. Wiens. Robustness in spatial studies ii: minimax design. Environmetrics, 16:205­217, 2005. [16] A. Ostfeld, J. G. Uber, and E. Salomons. Battle of water sensor networks: A design challenge for engineers and algorithms. In 8th Symposium on Water Distribution Systems Analysis, 2006. [17] L. A. Rossman. The epanet programmer's toolkit for analysis of water distribution systems. In Annual Water Resources Planning and Management Conference, 1999. 8