Active Learning for Directed Exploration of Complex Systems

Michael C. Burl Michael.C.Burl@jpl.nasa.gov Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109 Esther Wang California Institute of Technology, 1200 E. California Blvd, Pasadena, CA 91125 eswang@caltech.edu

Abstract
Physics-based simulation codes are widely used in science and engineering to model complex systems that would be infeasible to study otherwise. Such codes provide the highest-fidelity representation of system behavior, but are often so slow to run that insight into the system is limited. For example, conducting an exhaustive sweep over a d-dimensional input parameter space with ksteps along each dimension requires k d simulation trials (translating into k d CPU-days for one of our current simulations). An alternative is directed exploration in which the next simulation trials are cleverly chosen at each step. Given the results of previous trials, supervised learning techniques (SVM, KDE, GP) are applied to build up simplified predictive models of system behavior. These models are then used within an active learning framework to identify the most valuable trials to run next. Several active learning strategies are examined including a recently-proposed information-theoretic approach. Performance is evaluated on a set of thirteen synthetic oracles, which serve as surrogates for the more expensive simulations and enable the experiments to be replicated by other researchers.

climate), inability to replicate appropriate conditions in a laboratory environment, and so on. Physics-based numerical simulations offer the only practical means to model and study many of these systems. The internal logic of a simulation code frequently takes the form of coupled partial differential/difference equations that describe the interactions between a large set of elemental nodes or particles. For example, the planetary physics simulations of Durda et al. (2004) use a Smooth Particle Hydrodynamics (SPH) code to model the propagation of energy and fragmentation during the initial collision between two asteroids followed by an efficient N-body gravitational code to model the subsequent interactions between the resulting fragments. For our purposes, we are not so interested in the detailed internals of the simulation code; we simply treat the simulator as a black box that takes input parameters and produces raw output. If necessary, a "grading script" can be introduced to evaluate and transform the raw output of the simulator into a binary-valued decision indicating whether the simulation trial produced a desired outcome (+1) or not (-1). Different grading scripts can address various scientific questions within the scope of a single simulator; however, to simplify the terminology, we will take "simulator" to mean "simulator-grading script combination". A toy problem considered in the experiments is the classic inverted pendulum on a cart studied in control theory (Kwakernaak & Sivan, 1972). The goal is to balance the stick (inverted pendulum) in a vertical position by moving the cart back and forth. The input to this "simulator" consists of four state feedback gain coefficients, and the output is a +1 if the resulting controller succeeds in balancing the stick and a -1 if it fails. Although in this example a learner could plausibly receive a real-valued signal indicating how close the control system was to succeeding (or failing), we restrict our attention to the situation in which the

1. Introduction
Many complex processes cannot be studied directly for a variety of reasons including: cost, safety, nonrepeatability (e.g., destructive testing), lack of control over relevant variables (e.g., conditions affecting global
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Active Learning for Directed Exploration of Complex Systems

learner is only informed of a binary-valued outcome, as this assumption better fits our actual problem. Using simulations and directed exploration, we attempt to efficiently learn the set of all controllers that stabilize the system (produce a +1 output). 1.1. Related Work Active learning has received considerable attention in the machine learning community dating back to the work of MacKay (1992) and Cohn et al. (1994). These techniques have been applied with some success to the problem of efficiently learning a predictive model of a continuous-valued output, i.e., in the regression setting. For example, Gramacy et al. (2004) used nonstationary Gaussian process trees to explore a computational fluid dynamics simulation of a NASA reentry vehicle. Pfingsten (2006) used a Bayesian active learning technique to assist in analyzing micro-mechanical sensors. A research area known as Design and Analysis of Computer Experiments (DACE) (Sacks et al., 1989) uses statistical methods, including kriging, to construct surrogates to deterministic computer models. However, outside of our own pilot work (Burl et al., 2006), there has been little effort directed toward using active learning to efficiently explore simulations that provide binary-valued outputs. The directed exploration problem arises not only for simulators, but also for physical systems that are capable of autonomously performing experiments. Knuth et al. (2007) considered a problem in which a robot arm with a light sensor sequentially makes measurements within its workspace to efficiently determine the size and location of a target region. Nested sampling and Bayesian adaptive estimation were used, but the technique has not been demonstrated outside of a fairly limited hypothesis space consisting of circular disks. Their approach is closely related to the Bayesian Experimental Design philosophy used in (Veeramachaneni et al., 2006; Olivetti, 2008). Guestrin et al. (2005) consider the problem of optimal sensor placement using Gaussian Processes and a mutual information criterion. They conclude that the sensors should be spread evenly throughout the domain since the posterior variance in their predictions does not depend on the observed sensor values. However, in the classification setting that we consider, the observed "sensor values", i.e., the outcomes of simulator trials, are critical in deciding where to sample next.

and, with the help of a grading script, outputs a binary-valued indicator of whether the simulation trial produced a desired outcome (+1) or not (-1). We denote the end-to-end function from input to output as q(). It is well known that some dynamical systems are chaotic; for such systems small variations in  could produce very different values from q . We do not consider such systems, nor do we consider noisy systems where the value returned for a given input  varies from one run to the next. Instead, we assume a deterministic simulator that has some underlying spatial coherence (spatial correlation, spatial structure, continuity) in the q(ˇ) function. 2.1. Approximation from Samples The Nyquist Sampling Theorem, a fundamental result from signal processing, states that a band-limited signal can be exactly represented by its samples provided that the sampling rate is high enough. Because of our spatial coherence assumption, an oracle will consist mainly of low spatial frequency regions (except, of course, for step edges at the transition between areas of +1 and -1); hence, there is some hope that the function can be well-approximated through a finite set of samples. Results from computational learning theory, such as the Support Vector Machine (SVM) of Vapnik (1995), also suggest that one can learn any well-behaved function from a set of training data. If all of the grid points and corresponding labels from the example above are supplied to an SVM learning algorithm as training data, a relatively small subset of the data, i.e., the support vectors, will be sufficient to define a function that agrees almost everywhere with the true function q(ˇ). For the simulation problem, we want to construct an approximation, q (), to q() such that q (ˇ) agrees with ^ ^ q(ˇ) over most of the input domain. Further, the function q (ˇ) will be defined from training examples: input^ output pairs that are in essence samples in the signal processing sense1 from the true function q(ˇ). Although one could select these samples through (i) a grid strategy in which trial points are spread evenly across the input domain or (ii) a random strategy in which trial points are chosen at random within the domain, we believe that a more informative set of points can be actively selected by taking into account the current state of knowledge. Thus, our basic approach consists of predictive modeling plus active learning. The physics-based simulation is used as an oracle to sequentially gener1 There is no constraint, however, that the samples be regularly spaced.

2. Approach
The simulator takes as input a vector of parameters, , defined over a continuous-valued input domain, ,

Active Learning for Directed Exploration of Complex Systems

ate labeled training data in the form of input-output pairs. At each step or round, the algorithm uses a supervised learning algorithm to form a simplified predictive model based on currently available training data. However, because the simulations are so computationally-demanding, it is critical to get the most out of each simulation trial. An active learning algorithm then chooses an unlabeled point that would be particularly valuable if it were labeled (by a run through the simulator). 2.2. Predictive Models The predictive model, q (ˇ), is supposed to approximate ^ the input-output behavior of the simulator. As more training data is acquired through ongoing runs of the simulator, our estimate for q will change (hopefully ^ improve!); a superscript (r) will be added to q to des^ ignate the estimated surrogate function after the r-th example from the simulator has been observed. Also, since we recover q (ˇ) through a learning procedure that ^ typically depends on additional hyperparameters, e.g., (r) kernel type, kernel bandwidth, we use q (ˇ) to desig^  nate the estimated function after the r-th trial using hyperparameters . Although the surrogate function will ultimately produce binary-valued labels (+1) or (-1), we prefer to think of q (ˇ) as producing a continuous-valued out^ put that can be thresholded at different values to produce a receiver operating characteristic (ROC) curve. We consider three different types of predictive models for the surrogate function: support vector machines (SVM), kernel density estimation (KDE), and Gaussian processes (GP). 2.2.1. Support Vector Machines (SVM) Since their introduction by Vapnik, SVMs have been extensively studied. Learning an SVM from training data involves solving a quadratic programming problem. We use the libsvm implementation (Chang & Lin, 2001) in our work. Hyperparameters for the SVM learning problem include the type of kernel (RBF, polynomial, etc.), any kernel specific parameters (RBF bandwidth), and the regularization constant (C). The learned model consists of some number ns of support vectors si , weights i , and a scalar offset b. The decision statistic is defined as:
ns

the continuous-valued decision statistic. We could use a logistic function as in (Platt, 1999) to transform this into a pseudo-probability. 2.2.2. Kernel Density Estimation (KDE) With kernel density estimation (KDE), also known as the Parzen Window Technique, we attempt to directly estimate the posterior probability that a point belongs to the positive or negative class by modeling the classconditional probability densities. KDE estimates the posterior probability for the label t of a point z as follows: S+ (z) (2) p(t = +1|z, X, y) = S+ (z) + S- (z) where S+ (z) = + + X+ K(z, xi ) and S- (z) = - + X- K(z, xj ). X+ is the set of instances in X that have +1 labels and X- is the set of instances in X that have -1 labels. The scalar values + and - are used in Bayesian fashion to define a prior probability that must then be supported (or overturned) by the actual data. Active learning requires frequent update of the classifier given a newly selected training example. It is clear from Equation 2 that KDE allows a simple recursive update. 2.2.3. Gaussian Processes (GP) Let r be a (hidden) random variable that represents a continuous-valued version of the label t of z. Under a GP model, the posterior probability density over r given z and the training set is: p(r|z, X, y) =
2 N r; ľr|z,X,y , r|z,X,y

(3)

where N represents the normal distribution. The conditional mean and covariance are as follows: ľr|z,X,y = K(z, X) ˇ K(X, X)-1 ˇ y (4) (5)
2 r|z,X,y=K(z, z)-K(z, X)K(X, X)-1 K(X, z)

Integrating the posterior probability distribution over r from 0 to  provides a probability for the event {t = +1}.


p(t = +1|z, X, y)

=
0

p(r|z, X, y) dr

(6) (7)

2 = 1-(0; ľr|z,X,y , r|z,X,y )

q () ^

=

-b +
i=1

i K(, si )

(1)

where  is a point in input space and K(ˇ, ˇ) is the kernel function. Note that we have omitted the usual sgn(ˇ) function around the right hand side to preserve

where  is the cumulative distribution function of the Gaussian distribution. As with KDE, the GP classifier allows for a relatively straightforward recursive update (in particular, the inverse covariance can be efficiently updated using the Block Matrix Inversion Lemma). The overall cost is roughly O(P M + P 2 ), where P is the number of training examples and M is the number of different values of z.

Active Learning for Directed Exploration of Complex Systems

2.3. Active Learning Strategies 2.3.1. Passive Learning (PAS) In passive learning, the next point is randomly chosen from the remaining set of unlabeled points. In contrast, in active learning we choose the next point according to some value criterion. 2.3.2. Most Confused Point (MCP) A common active learning strategy (Tong & Koller, 2001) chooses the unlabeled point whose label is most uncertain given the current state of knowledge. Let q (r) () represent the probability that  should be as^ signed to class +1 given information up to round r. Then an MCP strategy would choose:  MCP = arg min |^(r) () - 0.5| q (8)

Now, suppose that we have the ability to "look ahead" and add another point u to the set of training instances (Lindenbaum et al., 2004). We do not yet know the label of this new point, only the probability that u will have label v = +1. We can compute the uncertainty about the label t of z given the original training instances (X), the training labels (y), and the new training instance (u) along with the hypothesized label v = +1. Using the hypothesized label v = -1, the uncertainty under the same conditions can be computed. Then, the expected uncertainty is given by: H(t|u) = Ev {H(t|z, X, y, u, v)} (11) = H(t|z, X, y, u, v = +1) ˇ p(v = +1|u, X, y) + H(t|z, X, y, u, v = -1) ˇ p(v = -1|u, X, y) (12) The expected information gain, I, about the label t is the expected amount of uncertainty that would be removed by incorporating the new instance u into the training set: I(t; u) = H(t|z, X, y) - H(t|z, X, y, u) (13)

(For SVM, this expression assumes the raw hyperplane distance has been converted into a pseudo-probability.) Alternatively, we could convert the probabilities into a single point entropy H(p) = -p log2 p - (1 - p) log2 (1 - p) and then select the point with maximum entropy. 2.3.3. Most Informative Point (MIP) Even if the label of a particular point is highly uncertain, knowledge of its true label may have little leverage over other unlabeled points. Consider a point A that falls precisely between a well-understood region of positive examples and a similar region of negative examples. Knowing the label of A may not be very valuable because A does not have much influence over the other unlabeled points. Now consider a point B that is in an unexplored cluster of points far away from any currently labeled points. Knowing the label of B may be extremely valuable because the other points in the cluster would presumably belong to the same class. Thus, knowing this label simultaneously removes uncertainty about the labels of many other points. The basic MIP idea, as developed in (Holub et al., 2008), is as follows. Suppose we have a set of training instances X with known labels y  {-1, +1}. From this training set, we can learn a classifier that makes a probabilistic prediction about the label t of a test point z. The uncertainty about t given the training instances and training labels can be measured with the binary entropy function H(ˇ) from Equation 9: H(t|z, X, y) = H(p) (10) (9)

This represents the expected amount of information (in bits) that we will gain about t by incorporating u into the training set. With multiple unlabeled instances (different points t), we can sum the expected information gain to determine the overall value of incorporating u into the training set. Since both KDE and GP directly provide probability estimates for points, we consider using these methods with MIP. Two approaches could be tried to make SVM compatible with MIP. The logistic regression approach of Platt could be used to convert the SVM hyperplane distance into a pseudo-probability. Alternatively, the committee of classifiers approach introduced in (Holub et al., 2008) could be used to construct probabilities from a hard-classifier. However, this approach requires two levels of lookahead, making it computationally impractical. We have yet to evaluate either of these approaches for SVM. 2.3.4. Randomization Meta-Strategy (META) We also considered a meta-strategy in which the base active learning algorithm does not make a definite choice about which point to deliver to the oracle; instead, the base algorithm simply provides a ranking of the points or a weighted preference over the points. The preference is converted to a probability distribution over the unlabeled points. The actual point delivered to the oracle is then picked randomly according to this probability distribution. The META strategy was only evaluated with SVM. Baram and Luz (Baram et al., 2004) discuss other meta-learning strategies.

where p is the classifier's estimate of the probability that the label of z is +1.

Active Learning for Directed Exploration of Complex Systems

3. Experimental Results
3.1. Oracles There is a significant cost in time and computer resources in acquiring simulation data. In many cases, the resulting datasets are proprietary making it difficult to use such data for algorithm development or head-to-head comparisons. Hence, our experiments were conducted using a set of surrogate oracles2 . All oracles are defined over the domain  = [-1, +1]d , where d is the dimensionality (number of input parameters). Figure 1 (at the end of the paper) illustrates the datasets. 3.2. Evaluation Methodology Nine methods were evaluated on the thirteen surrogate oracles shown in Figure 1: SVM-PAS, SVM-MCP, SVM-META, KDE-PAS, KDE-MCP, KDE-MIP, GPPAS, GP-MCP, GP-MIP. Each method was run seven times on each of the oracles for 400 rounds of active learning. To reduce the run time of the experiments, the active learning algorithms considered only a subset of 100 points from the unlabeled pool at each round. All three of the supervised learning algorithms used the same RBF kernel (K(x1 , x2 ) = exp(-||x1 - x2 ||2 )) with parameter  = 50. For SVM the regularization parameter was set to C = 10. For KDE, the values of + and - were set to 0.1. For GP, the value of the noise standard deviation n was set to 0.1. For each of the methods, a classifier was trained using the sequence of points selected by active learning during round 1 through round r. The resulting classi(r) fier, q , was applied over a fine discretization of the ^ full domain of the oracle. At each discretized point, a real-valued number is produced by the classifier. For KDE and GP, this number is an estimate of the posterior probability that the point is in the positive class, while for SVM, the number is the signed hyperplane distance as in Equation 1. ROC curves are calculated for each method and oracle, as a function of the number of rounds, r, of active learning. Figure 2 shows ROC curves for GP-MCP on Oracle 7. Note that each curve has been resampled using a fixed set of points on the Pf a axis from 0 to 1 spaced in steps of 0.0005. Traditionally, active learning algorithms are evaluated with learning curves that show the classifier accuracy as a function of round. However, our oracles and the
Matlab code for reproducing these oracles is available by request from the author.
2

1

0.9

0.8

0.7

0.6

Pd

0.5

0.4

0.3

0.2 Rounds Rounds Rounds Rounds 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 = = = = 50 100 200 400 1

0.1

0

Pfa

Figure 2. ROC curves on Oracle 7 for GP-MCP, colorcoded by number of rounds (r). For each value of r, seven trials were conducted to assess variability.

simulator problem in general have a significant imbalance between the positive and negative classes, with the negatives greatly outweighing the positives. A classifier that simply says -1 regardless of the input point will already achieve a high degree of agreement with q(ˇ). Instead of using classification accuracy, we use the ROC curves to identify the probability of detection Pd at a fixed false alarm rate and plot Pd versus rounds. Figure 3 shows the modified learning curves for each of the nine methods on Oracle 9.
1

0.9

0.8

Pd

0.7

0.6

0.5

SVM-PAS SVM-MCP SVM-META KDE-PAS KDE-MCP KDE-MIP GP-PAS GP-MCP GP-MIP 0 50 100 150 200 250 300 350 400 450

0.4

Round

Figure 3. Learning curves for each of the nine methods on Oracle 9. Each curve shows the probability of detection (Pd ) at a fixed false alarm probability (Pf a = 0.05) as a function of the number of rounds of active learning. Except for SVM-MCP, the active methods clearly outperform passive. KDE-MCP reaches 90% Pd at round 100, while KDE-PAS doesn't reach this level until after round 300, representing speedup by a factor of three.

3.3. Discussion For each of the supervised learning techniques, the results clearly show that active learning outperforms passive learning. For example, in Figure 4, row 1 shows KDE using MIP (information gain) for active learning

Active Learning for Directed Exploration of Complex Systems
1

0.9

0.8

3 6 10 58 4

21

1

0.9

7 8 10 9

3 62 5 1 4

0.8

input domain, i.e., it falls into a sampling selection bias (Zadrozny, 2004) trap.

0.7

0.7

KDE-MIP

KDE-MIP

0.6

79

0.6

12 11 13

0.5

0.5

4. Conclusion
Directed exploration can be used to efficiently understand the behavior of complex systems modeled by numerical simulations. The simulator is used as an oracle to sequentially generate training data (input-output pairs); a simplified predictive model of the system is learned from the currently available training data and this model is used in an active learning framework to choose the best simulation trials to run next. Performance evaluation using active learning on a challenging set of oracles showed a significant improvement over passive learning, in some cases showing a speedup by a factor of three in the number of rounds required to reach 90% probability of detection at a false alarm rate of 0.05.

0.4

0.4

0.3

0.3

0.2

13 11

12

0.2

0.1

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KDE-PAS
1

KDE-PAS
21
1

0.9

4 9 7 10

6 3 8 5

0.9

1 2 63 4 79 8 5 10

0.8

0.8

0.7

0.7

12 11

GP-MCP

0.5

GP-MCP

0.6

0.6

12

0.5

0.4

0.4

0.3

0.3

13

11
0.2
0.2

0.1

0.1

13
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

SVM-META

SVM-META

Figure 4. Head-to-head comparison between two methods on all oracles. Each box corresponds to one oracle; box size indicates standard error. Points above the red line favor the y-axis method, below the red line favors the xaxis method. (a) KDE-MIP vs. KDE-PAS (r = 50). (b) KDE-MIP vs. KDE-PAS (r = 200). (c) GP-MCP vs. SVM-META (r = 50). (d) GP-MCP vs. SVM-META (r = 200). KDE-MIP dominates KDE-PAS. GP-MCP and SVM-META perform similarly.

Acknowledgment
This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. c 2009 California Institute of Technology. Government sponsorship acknowledged. We thank collaborators B. Enke, W.J. Merline, P. Perona, and A. Holub, as well as the anonymous reviewers.

is almost always better than KDE using passive learning on all thirteen oracles at rounds 50 and 200. Looking at additional pairs of box plots (not shown) or at Tables 1-3, we can also conclude that the performance of MCP active learning is approximately the same across the three different supervised learning techniques. Somewhat unexpectedly, MCP appears to be slightly better than the more sophisticated MIP. We hypothesize that MIP may be more sensitive to the accuracy of the estimated probabilities. Another limitation with MIP is that it scales quadratically with the size of the unlabeled pool. For computational reasons, it was necessary to consider a relatively small random subset of the full unlabeled pool at each round, which may have also degraded the performance. A possibility for improving the scalability is to use MCP as a prefilter to prioritize the points that should be evaluated with the more expensive information gain criterion. SVM-META was typically better than SVM-MCP. In Figure 3, SVM-MCP performs worse than passive learning. We believe that SVM-MCP finds one of the "islands" in Oracle 9 and focuses on refining its boundary rather than adequately exploring the rest of the

References
Baram, Y., El-Yaniv, R., & Luz, K. (2004). Online choice of active learning algorithms. J. Machine Learning Research, 5, 255­291. Burl, M. C., DeCoste, D., Enke, B. L., Mazzoni, D., Merline, W. J., & Scharenbroich, L. (2006). Automated knowledge discovery from simulators. SIAM Int. Conf. on Data Mining (pp. 82­93). Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm. Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15, 201­221. Durda, D. D., Bottke, W. F., Enke, B. L., Merline, W. J., Asphaug, E., Richardson, D. C., & Leinhardt, Z. M. (2004). The formation of asteroid satellites in large impacts: results from numerical simulations. Icarus, 170, 243­257. Gramacy, R. B., Lee, H. K., & Macready, W. G.

Active Learning for Directed Exploration of Complex Systems

(2004). Parameter space exploration with gaussian process trees. Int. Conf. on Machine Learning (pp. 353­360). Guestrin, C., Krause, A., & Singh, A. P. (2005). Nearoptimal sensor placements in gaussian processes. Int. Conf. on Machine Learning (pp. 265­272). Holub, A., Perona, P., & Burl, M. C. (2008). Entropy-based active learning for object recognition. Proc. Online Learning for Classification (pp. 1­8). Knuth, K., Erner, P., & Frasso, S. (2007). Designing intelligent instruments. MaxEnt, Amer. Inst. of Physics, Conf. Proc. 954 (pp. 203­211). Kwakernaak, H., & Sivan, R. (1972). Linear optimal control systems. Wiley-Interscience. Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective sampling for nearest neighbor classifiers. Machine Learning, 54, 125­152. MacKay, D. (1992). Information-Based Objective Functions for Active Data Selection. Neural Computation, 4, 590­604. Olivetti, E. (2008). Sampling strategies for expensive data. Doctoral dissertation, U. of Trento. Pfingsten, T. (2006). Bayesian active learning for sensitivity analysis. Lec. Notes in C.S., 4212, 353­364. Platt, J. (1999). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Adv. in Large Margin Classifiers (pp. 61­74). MIT Press. Sacks, J., Welch, W., Mitchell, T., & Wynn, H. (1989). Design and Analysis of Computer Experiments. Statistical Science, 4, 409­423. Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Machine Learning Research, 2, 45­66. Vapnik, V. (1995). The nature of statistical learning theory. Springer, New York. Veeramachaneni, S., Olivetti, E., & Avesani, P. (2006). Active sampling for detecting irrelevant features. Int. Conf. on Machine Learning (pp. 961­968). Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. Int. Conf. on Machine Learning (p. 114).

Table 1. Probability of detection, Pd , for the SVM methods on each oracle for Pf a = 0.05 at round=50. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 SVM-PAS 0.918 ą 0.028 0.867 ą 0.037 0.774 ą 0.038 0.713 ą 0.044 0.653 ą 0.054 0.794 ą 0.044 0.612 ą 0.049 0.647 ą 0.047 0.591 ą 0.037 0.603 ą 0.110 0.287 ą 0.046 0.349 ą 0.070 0.238 ą 0.096 SVM-MCP 0.975 ą 0.005 0.979 ą 0.004 0.886 ą 0.043 0.636 ą 0.033 0.811 ą 0.035 0.843 ą 0.046 0.775 ą 0.019 0.810 ą 0.022 0.561 ą 0.101 0.866 ą 0.015 0.316 ą 0.064 0.268 ą 0.055 0.336 ą 0.144 SVM-Meta 0.988 ą 0.004 0.951 ą 0.019 0.908 ą 0.032 0.782 ą 0.033 0.872 ą 0.029 0.851 ą 0.040 0.748 ą 0.045 0.823 ą 0.021 0.602 ą 0.072 0.713 ą 0.031 0.354 ą 0.059 0.315 ą 0.054 0.392 ą 0.167

Table 2. Probability of detection, Pd , for the KDE methods on each oracle for Pf a = 0.05 at round=50. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 KDE-PAS 0.880 ą 0.030 0.865 ą 0.045 0.772 ą 0.043 0.713 ą 0.035 0.642 ą 0.047 0.763 ą 0.051 0.588 ą 0.052 0.674 ą 0.055 0.622 ą 0.038 0.603 ą 0.079 0.264 ą 0.039 0.332 ą 0.071 0.246 ą 0.079 KDE-MCP 0.988 ą 0.006 0.972 ą 0.009 0.947 ą 0.018 0.813 ą 0.048 0.795 ą 0.040 0.890 ą 0.015 0.731 ą 0.023 0.793 ą 0.023 0.725 ą 0.035 0.781 ą 0.050 0.235 ą 0.057 0.513 ą 0.036 0.140 ą 0.054 KDE-MIP 0.962 ą 0.021 0.981 ą 0.005 0.929 ą 0.019 0.822 ą 0.015 0.832 ą 0.033 0.885 ą 0.017 0.634 ą 0.028 0.815 ą 0.015 0.615 ą 0.035 0.881 ą 0.029 0.204 ą 0.026 0.167 ą 0.028 0.235 ą 0.114

Table 3. Probability of detection, Pd , for the GP methods on each oracle for Pf a = 0.05 at round=50. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 GP-PAS 0.873 ą 0.027 0.857 ą 0.046 0.768 ą 0.041 0.730 ą 0.033 0.666 ą 0.044 0.744 ą 0.051 0.609 ą 0.050 0.682 ą 0.051 0.631 ą 0.040 0.603 ą 0.079 0.265 ą 0.040 0.334 ą 0.071 0.243 ą 0.078 GP-MCP 0.955 ą 0.009 0.951 ą 0.012 0.890 ą 0.022 0.866 ą 0.016 0.810 ą 0.033 0.878 ą 0.011 0.720 ą 0.032 0.820 ą 0.025 0.758 ą 0.030 0.685 ą 0.055 0.250 ą 0.025 0.474 ą 0.024 0.029 ą 0.012 GP-MIP 0.929 ą 0.008 0.924 ą 0.012 0.762 ą 0.029 0.709 ą 0.014 0.841 ą 0.012 0.620 ą 0.037 0.531 ą 0.030 0.834 ą 0.042 0.591 ą 0.020 0.798 ą 0.026 0.385 ą 0.014 0.339 ą 0.021 0.217 ą 0.082

Active Learning for Directed Exploration of Complex Systems
Oracle 1
20 20

Oracle 2
20

Oracle 3

40

40

40

60

60

60

80

80

80

100

100

100

120

120

120

140

140

140

160

160

160

180

180

180

200 20 40 60 80 100 120 140 160 180 200

200 20 40 60 80 100 120 140 160 180 200

200 20 40 60 80 100 120 140 160 180 200

Oracle 4
20 20

Oracle 5
20

Oracle 6

40

40

40

60

60

60

80

80

80

100

100

100

120

120

120

140

140

140

160

160

160

180

180

180

200 20 40 60 80 100 120 140 160 180 200

200 20 40 60 80 100 120 140 160 180 200

200 20 40 60 80 100 120 140 160 180 200

Oracle 7
20 20

Oracle 8
20

Oracle 9

40

40

40

60

60

60

80

80

80

100

100

100

120

120

120

140

140

140

160

160

160

180

180

180

200 20 40 60 80 100 120 140 160 180 200

200 20 40 60 80 100 120 140 160 180 200

200 20 40 60 80 100 120 140 160 180 200

Oracle 10

Oracle 11

Oracle 12

40 35 30 25

40 35 30 25

40 35 30 25

Z

Z

15 10 5 0 40 30 20 10 10 15 20 25 30 35 40

15 10 5 0 40 30 20 10 10 15 20 25 30 35 40

Z

20

20

20 15 10 5 0 40 30 20 10 10 15 20 25 30 35 40

Y

0

0

5

X

Y

0

0

5

X

Y

0

0

5

X

Oracle 13
1 0.8

0.6

0.4

0.2

0

-0.2

-0.4

-0.6

-0.8

-1

1

1.5

2

2.5

3

3.5

4

Figure 1. Thirteen oracles were used to evaluate the algorithms. (1)-(9) LevelSet2D (d = 2, P + = 0.125); (10)-(12) LevelSet3D (d = 3, P+ = 0.062); (13) Inverted Pendulum (d = 4, P+ = 0.007).