Compositional Noisy-Logical Learning Alan Yuille Department of Statistics, University of California, Los Angeles, CA 90095 USA yuille@stat.ucla.edu Songfeng Zheng SongfengZheng@MissouriState.edu Department of Mathematics, Missouri State University, Springfield, MO 65897 USA Abstract We describe a new method for learning the conditional probability distribution of a binary-valued variable from labelled training examples. Our proposed Compositional Noisy-Logical Learning (CNLL) approach learns a noisy-logical distribution in a compositional manner. CNLL is an alternative to the well-known AdaBoost algorithm which performs coordinate descent on an alternative error measure. We describe two CNLL algorithms and test their performance compared to AdaBoost on two types of problem: (i) noisy-logical data (such as noisy exclusive-or), and (ii) four standard datasets from the UCI repository. Our results show that we outperform AdaBoost while using significantly fewer weak classifiers, thereby giving a more transparent classifier suitable for knowledge extraction. ditive logistic regression (Friedman et al., 2000) where the conditional distribution of the output is expressed in exponential form in terms of statistics of the input. The statistics and their parameters of logistic regression correspond to the weak classifiers and their weights of AdaBoost. Friedman et al. prove theorems relating these approaches in the asymptotic limit (Friedman et al., 2000). These results suggest that the effectiveness of AdaBoost for a specific application will depend on how well the conditional distribution for the application data can be approximated by an exponential distribution. If we can approximate the distribution sparsely (i.e., using an exponential distribution with a small number of statistics) then we expect that AdaBoost will give high classification performance and will perform knowledge extraction by specifying which statistics are important. But suppose that the distribution can only be approximated by an exponential model which uses a large number of statistics. In this case, AdaBoost may be successful for classification but will result in a strong classifier which is a combination of many weak classifiers and will fail to extract useful knowledge about the data. We can make an analogy to harmonic analysis where the goal is to express functions in terms of a linear combination of basis functions, e.g, (Meyer, 2001). There are many choice of bases ­ Fourier series, Haar bases, wavelets, polynomials ­ which are complete in the sense that any function can be represented in terms of them (by weighted linear combination). But for any specific application it is usually desirable to use bases which can represent the functions sparsely so that a good approximation to the function is obtained by using a small number of basis functions. This gives insight into the application and can be thought of as knowledge extraction. Mathematicians have shown that certain bases are best for representing functions in specific functional classes (Meyer, 2001) in the sense 1. Introduction AdaBoost (Freund & Schapire, 1996) is one of the most influential machine learning algorithm for binary classification. It estimates a strong classifier from training data by combining a weighted sum of weak classifiers and can be formulated as coordinate descent on an upper bound of the training error. But despite its successes in terms of classification performance, AdaBoost is less effective at detecting underlying structure, or extracting knowledge, from the data. Knowledge extraction is important both for transfer learning and for debugging and understanding machine learning systems (Alpaydin, 2004). There is a deep connection between AdaBoost and adAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Compositional Noisy-Logical Learning that fewer basis functions are required to get good approximations. These considerations suggest that we design machine learning algorithms based on different representations of conditional distributions. Recently noisy-logical distributions were proposed (Yuille & Lu, 2008) as an alternative to exponential models (this paper did not contain a learning algorithm). These distributions are complete, in the sense that any conditional distribution can be expressed in noisy-logical form, and are built by composing elementary components which yield a structured representation for the distribution. Certain distributions ­ such as noisy-or and noisyand-not (Pearl, 1988) ­ can be sparsely represented in noisy-logical form but are hard to represent sparsely by exponential distributions (and conversely). In general, noisy logical distributions are sparse if there is a (noisy) logical process generating the data. This paper presents two Compositional Noisy-Logic Learning (CNLL) algorithms which perform classification by learning noisy-logical distributions. The hope is that CNLL algorithms will have better performance and yield sparser representations than AdaBoost for certain forms of classification problems. We emphasize that sparse representations can enable knowledge extraction which can yield many benefits including transfer learning. Sparse representations can also have practical benefits because they reduce the computation time (after learning) and could be useful for applications where computation power is limited. Our experimental results using CNLL are very promising. Firstly, we consider applications where the data is generated by a noisy-logical process ­ for example, a noisy version of the exclusive-or problem. In these cases, CNLL is able to learn the underlying (noisy) logical process that generates the data and hence performs knowledge extraction. Secondly, we apply CNLL to four standard datasets from the UCI repository which are used to test AdaBoost and its variants. On these datasets we obtain significantly better performance than AdaBoost and also obtain sparse noisy-logical representations of the data. use noisy-or and noisy-and-not distributions (Cheng, 1997; Griffiths & Tenenbaum, 2005) and more complex noisy-logical distributions (Yuille & Lu, 2008). A noisy-logical distribution represents the distribution P (y|C) of a binary variable y {0, 1} conditioned on a binary vector C {0, 1}N in terms of composition of elementary distributions, see Fig. 1(left panel). Each elementary distribution is of form P (Hi |i (C), i ) where Hi {0, 1} is a hidden variable, i (C) {0, 1} is a causal feature of C, and i are the parameters of the distribution. The full distribution is obtained by setting the output y to be a (deterministic) logical combination of the hidden states {Hi }. This is formally expressed as: P (y = 1|C; ) = 2N -1 (1) P (Hi = 1|i (C); i ), i=0 y,f (H0 ,···,H2N -1 ) H where f (H0 , · · · , H2N -1 ) is a logical (i.e. deterministic) function of the hidden variables {Hi }; y,f = 1 if y = f and 0 otherwise. It was shown (Yuille & Lu, 2008) that any conditional distribution on binary variables can be expressed in noisy-logical form. In particular, the standard noisyor and noisy-and-not distributions (Pearl, 1988) can be obtained as special cases by setting C = (C1 , C2 ), using causal features 1 (C) = C1 , 2 (C) = C2 , and letting y = H1 H2 to obtain noisy-or and y = H1 ¬H2 to obtain noisy-and-not. In these cases, the noisylogical distribution takes a simple form which gives insight into the structure of the data ­ in particular, by showing the logical relationship between the output y and the hidden states H1 , H2 . More generally, we can express this relationship in Disjunctive Normal Form (DNF), for example, y = (H1 H2 · · ·) (H3 H4 · · ·) · · ·, see Fig. 1(right panel). We can think of this DNF as specifying a logical process underlying the data which is made noisy by the distributions P (Hi |i , i ). In other words, there is a deterministic rule for the data specified by the {Hi } but we cannot observe that {Hi } directly and instead must infer them from noisy observations {i (C)}. By contrast, additive linear regression (Friedman et al., 2000) is a standard way to represent distributions. We change notation so that the output variable y {-1, +1} and the input is x. This conditional distribution can be expressed in the form: P (y|x) = 1 exp{y Z[x, {i }] i i (x)}, i 2. Background: Noisy-Logical and Exponential Distributions The noisy-logical distribution (Yuille & Lu, 2008) was proposed as a generalization of the noisy-or and noisy-and-not distributions (Pearl, 1988). These types of distributions are of interest to cognitive scientists since human performance on causal learning tasks can be successfully modeled by assuming that humans (2) Compositional Noisy-Logical Learning Figure 1. The basic Noisy-Logical Distribution (left panel). The DNF formulation of Noisy-Logical (right panel). where the {i (x)} are statistics, the {i } are parameters, and Z[x, {i }] is the normalization term. The exponential distributions in Eq. (2) can be learnt from training data {(xµ , yµ ) : µ = 1, · · · , N } (yµ {±1}) by maximum likelihood (ML) to find = N arg max µ=1 P (yµ |xµ ). This has several limitations: (i) performing ML is difficult because it requires evaluating the normalization term Z[x, {i }]; (ii) ML learning may not be optimal for learning a decision rule when only a finite number of training samples are available. AdaBoost (Freund & Schapire, 1996) learns a strong M classifier H(x) = sign{ i=1 i i (x)} from training data without needing to evaluate the normalization term Z[x, {i }], using a set of weak classifiers {i (x)}(i (x) {±1}). It classifies new data x as positive if H(x) 0 or as negative if H(x) < 0. It can be formulated as coordinate descent on the function F () = µ exp{yµ i i i (xµ )}, where {i } are the components of . It can be shown (Friedman et al., 2000) that AdaBoost converges asymptotically to a strong classifier H(x) = sign{ i i (x)} where the coefficients { } i i are the maximum likelihood (ML) estimators = arg max P ({yµ }|{xµ }, ) of the linear additive regression model. Hence the strong classifier corresponds to ) the log-likelihood ratio test sign{log PP (y=1|x, ) }. (y=-1|x, 1, · · · , N }, where xµ is the data vector and yµ {0, 1} is the label. We specify a set of causal features { (x)} with (x) {0, 1}, where with indexes all the possible causal features. We will specify the precise form of these causal features in the experimental section (4). Our default is to have a set of stump features {fa (x)} for which we determine a casual feature (weak P (y=1|fa (x)) classifier): (a,b,c) (x) = 1 if pb log P (y=0|fa (x)) > Tc , where pb {±1} is parity and Tc is a threshold, and (a,b,c) (x) = 0 otherwise. We also consider logical AND's and OR's of these causal features (e.g. (a,b,c) (x) (a ,b ,c ) (x) and (a,b,c) (x) a ,b ,c (x)). To simplify notation, we write all causal features as (x). Both our CNNL algorithms construct will construct a noisy-logical distribution using the causal features as input. In this paper we use a variant of the noisylogical form presented in (Yuille & Lu, 2008). In equation (1) we set i = (i , i ) where i = P (Hi = 1|i (x) = 1; i ) and i = P (Hi = 1|i (x) = 0; i ) (i = 0 i in (Yuille & Lu, 2008)). The performance measure is the error ER of the loglikelihood classifier R(x) = sign{log P (y=1|x) }. The P (y=0|x) false positive errors are weighted more than the false negative errors to allow for the different number of positive and negative examples in the datasets. 3.1. Basic CNLL: The First CNLL Algorithm CNLL grows the distribution Pt (y|x) by composition, see Fig. 2. This involves adding a new hidden unit Ht+1 , a causal feature (x), a distribution P (Ht+1 | (x)) parameterized by t+1 , and a compositional rule to obtain ft+1 ({H1 , · · · , Ht }) by combining ft ({H1 , · · · , Ht }) with Ht+1 . We grow the distribution Pt (y|x) in a greedy manner by coordinate descent on the performance measure hence selecting the causal feature, and logical combination rule, that gives maximal increase in performance. 3. Compositional Noisy-Logical Learning (CNLL) We now develop Compositional Noisy-Logical Learning (CNLL) as an alternative learning strategy to AdaBoost based on the noisy-logical distribution. We supply two CNLL algorithms in sections (3.1,3.3). They both proceed by minimizing a performance measure in a greedy manner (similar to AdaBoost). We are given a set of training examples {(xµ , yµ ) : µ Compositional Noisy-Logical Learning We initialize the distribution Pt=0 (y|x) by setting f0 ({Hi }) = H1 H2 · · · and i = i = 0, i, which corresponds to P0 (y = 1|x) = 0, x. (An alternative initialization is f0 ({Hi }) = H1 H2 · · · with i = i = 1, i, which corresponds to P0 (y = 1|x) = 1, x.) an analytic expression. By contrast CNLL must do one dimensional searches for and within the range [0, 1] and check for the two possibilities (AND or OR) for . If = , then ft+1 ({Hi }) = ft ({Hi }) Ht+1 and Pt+1 (y = 1|x) = Pt (ft = 1|x) ×(t+1 t+1 (x) + t+1 (1 - t+1 (x))). If = , then ft+1 ({Hi }) = ft ({Hi }) Ht+1 and Pt+1 (y = 0|x) = Pt (ft = 0|x) ×{1 - t+1 t+1 (x) - t+1 (1 - t+1 (x))}. 3.2. Converting to DNF The final form of the noisy-logical distribution depends on the logical function y = f ({Hi }) relating the hidden states. We can express this in Disjunctive Normal Form (DNF) as y = CL1 CL2 · · · CLN where each clause is a conjunction of the {Hi }, see Fig 1 (right panel). Expressing the hidden states in DNF gives insight into the structure of the data. We can think of the data as being generated by (deterministic) logical process specified by the DNF but where the {Hi } are not observable and must be inferred from noisy measurements {i (·)}. The following iterative procedure can be used to transform the the output of the CNLL algorithm ­ y = f ({Hi }) = (· · · (H1 H2 ) H3 ) · · ·) ­ into DNF. We initialize at t = 1 by setting y = H1 , which is clearly in DNF. Now we suppose we have successfully converted the expression y = (· · · ((H1 H2 )H3 )H4 · · ·) Ht to DNF y = CL1 CL2 · · · CLN , where the CLi 's are conjunctions of Hi 's. For the (t+1)th iteration, we have y = ((· · · ((H1 H2 )H3 )H4 · · ·)Ht )Ht+1 = (CL1 CL2 · · · CLN ) Ht+1 . If = , then y = CL1 CL2 · · · CLN Ht+1 is of DNF form with CLN +1 = Ht+1 , i.e. we just add a new clause. If = , then y = (CL1 CL2 · · · CLN ) Ht+1 = (CL1 Ht+1 ) (CL2 Ht+1 ) · · · (CLN Ht+1 ) is of DNF form with CLi = CLi Ht+1 , i.e. we modify each clause by adding a conjunctive term. 3.3. The Second CNLL Algorithm The second CNLL algorithm represents the logical form as DNF. At each iteration it combines ft ({H1 , · · · , Ht }) with Ht+1 by allowing Ht+1 to be AND-ed with any of the clauses CL1 , · · · , CLN in ft ({H1 , · · · , Ht }), or AND-ed with all of the clauses, or alternatively to create a new clause. The pseudocode for this second CNLL algorithm is given in Fig. 4. This algorithm allows a greater number of ways to construct the logical form ft+1 ({H1 , · · · , Ht+1 }) from Figure 2. CNLL grows a Noisy-Logical distribution. The first CNLL algorithm has pseudo-code given in Fig. 3. It uses the composition rule: ft+1 ({H1 , · · · , Ht+1 } = ft ({H1 , · · · , Ht }) Ht+1 , where = or = (i.e. logical AND or OR). This composition generates a new distribution Pt,(,,) (y|x) which defines a classifier Rt,(,,) (x). To select the best extension of Pt (y|x), we solve for ( , , ) = arg min ER (Rt,(,,) ). (,,) This is obtained by an exhaustive search over and and, for fixed (, ), we minimize ER (Rt,(,,) ) with respect to = (, ). This search is simplified by the decomposition ER (Rt ) = ER,1 (, ) + ER,0 (, ), where ER,1 (, ) = µ: (x)=1 {1 - Rt (xµ ),yµ } and ER,0 (, ) = µ: (x)=0 {1 - Rt (xµ ),yµ }. It is straightforward to verify that ER,1 (, ) and ER,0 (, ) are functions of and , respectively. Hence we can determine t+1 [0, 1] and t+1 [0, 1] by two independent one dimensional searches. Then we set Pt+1 (y|x) = Pt,( , , ) (y|x). This require more computation that AdaBoost. AdaBoost must also do exhaustive search over all classifiers but can then determine the parameters by Compositional Noisy-Logical Learning Input: Training set {(xµ , yµ ) : µ = 1, · · · , N }, a set of causal features { (x) : }, where x is the data vector and y {0, 1} is the label, and indexes all the possible features. Initialize i = 0 and i = 0 for all i. For t = 1, · · · , T : · Set M inError = 1; · For each {, } ­ For each 1. find = arg min ER,1 (, ) by a 1-D search over [0, 1] 2. find = arg min ER,0 (, ) by a 1-D search over [0, 1] 3. Error = [ER,1 ( , ) + ER,0 ( , )]/N 4. if Error < M inError, then set ( , , , ) (, , , ), and set M inError = Error ­ End for · End for · Update ft+1 ({Hi }) = ft ({Hi }) Ht+1 and the distribution Pt (y = 1|x). End algorithm Output: Logical expression fT ({Hi }) and parameterized distribution P (y = 1|x; T , fT ). Figure 3. Pseudo-code for the first CNLL algorithm ft ({H1 , · · · , Ht }) and hence is a more powerful algorithm, but it is slightly slower since it has to evaluate more possibilities. 4. Experimental Results We evaluate CNLL in two ways. Firstly, by using data generated from noisy-logical processes ­ e.g. by logical rules such as exclusive-or (XOR) with some additional noise. This is to verify that CNLL gives the correct results for the class of problems that it is well suited to. Secondly, we compare it to AdaBoost on standard datasets. In both cases, we evaluate the algorithm in terms of classification performance and knowledge extraction. 4.1. Results on Noisy-Logical Data Noisy-logical data means that the process of generating the data is basically logical but with noise added. This can either be obtained by starting with a logical rule and adding noise (our first example) or by sampling from the discriminative model. For noisylogical data we expect that CNLL will give better performance than AdaBoost since the data will be described by a sparse noisy-logical distribution but not by a sparse exponential model. We anticipate that CNLL will learn the correct underlying logical process of the data ­ i.e., determine the correct DNF. We first consider a simple variant of the classic XOR problem. We generate 3,000 points (x1 , x2 ) [0, 2] × [0, 2] in the following manner: (i) sample 1,000 points uniformly in [0, 1] × [0, 1], (ii) sample 500 points uni- formly in [1, 2] × [1, 2], (iii) sample 400 points uniformly in [1, 2] × [0, 1], and (iv) sample 1,100 points uniformly in [0, 1] × [1, 2]. We classify a data-point (x1 , x2 ) as a positive example if x1 [0, 1] x2 [0, 1] OR x1 [1, 2] x2 [1, 2] and as a negative example otherwise. The weak classifiers are defined by thresholding the x1 and x2 coordinates ­ i.e., the set of rules of form xi > T1 or xi < T2 for i = 1, 2 where T1 , T2 are thresholds. We use the second CNLL algorithm to learn the underlying logical expression. Table 1 shows the selected features, the clause structure and the error rates after a new feature is added. CNLL learns the logical expression E = (x2 < 1 x1 < 1) (x1 > 1 x2 > 1), which is the correct solution to the XOR problem. (Note: the first CNLL algorithm gets stuck at a saddle point of the error measure and lacks the correct logical rule to descend further ­ it learns (H1 H2 ) H3 , but is unable to move to (H1 H2 ) (H3 H4 )). Table 1. The features selected for XOR problem. Feature x2 < 1 x1 > 1 x2 > 1 x1 < 1 Clause 1 2 2 1 Error 0.300 0.134 0.134 0.000 We now test CNLL on more noisy-logical data obtained by sampling from the discriminative distribution P (y|{i }). We specify a underlying logical process y = (H1 H2 )(H3 H4 ) for the data with added noise (described below). We compare the classification performance of CNLL and AdaBoost on this data and check to see whether CNLL discovers the correct noisy-logical structure. Compositional Noisy-Logical Learning Input: Training set {(xµ , yµ ) : µ = 1, · · · , N }, a set of causal features { (x) : }, where x is the data vector and y {0, 1} is the label, and indexes all the possible features. Goal: Learn the DNF expression f ({Hi }) = CL1 · · · CLn , with each CLi is a conjunction of a subset of {Hi }. Initialize: n = 1, and CL1 = 1, i = 0 and i = 0 for all i. For t = 1, · · · , T : //Trying to add Ht to an existing clause · Set M inErrC = 1; //the smallest error when adding to an existing clause · For i = 1, · · · , n ­ Let f = CL1 · · · CLi · · · CLn , with CLi = CLi Ht+1 ­ Search ( , , , Error) as in Fig. 3; ­ if Error < M inErrC, then set (C , C , C , C) ( , , , i), and set M inErrC = Error · Endfor // Now the parameters are stored in (C , C , C , C) // Trying to add Ht to every clause · Let f = (CL1 Ht ) · · · (CLn Ht ). · Search (, , ) to minimize the training error; denote the optimal parameter as (A , A , A ) and the minimal error as M inErrAll. //Trying to add a new clause Ht · Let f = CL1 · · · CLn Ht (CL1 = 0 if t = 1). · Search (, , ) to minimize the training error; denote the optimal parameter as (N , N , N ) and the minimal error as M inErrN . // updating the Clauses, and the DNF expression · Pick the smallest value from M inErrC, M inErrAll, and M inErrN , update each clause CLi accordingly, and update the clause number n if necessary; denote the new DNF expression as ft ({Hi }). · Update the distribution Pt (y = 1|x) according to the form of ft ({Hi }). End algorithm Output: DNF expression fT ({Hi }) and parameterized distribution P (y = 1|x; T , fT ) Figure 4. The second CNLL algorithm, "//" stands for comments. We generate the noisy-logical data by sampling from the distribution P (y|{i }) by exploiting the structure shown in Fig. 1. We first sample to get the {CLi } i from P ({CLi }|y) = P (y|{CLP})P ({CLi }) , next we sam(y) ple the {Hi } from P ({Hi }|{CLi }), and finally we sample the {i } from P ({i }|{Hi }) = P ({Hi }|{i })P ({i }) . P ({Hi }) More precisely, we set P (CL1 = 1, CL2 = 0|y = 1) = P (CL1 = 0, CL2 = 1|y = 1) = 0.3, and P (CL1 = 1, CL2 = 1|y = 1) = 0.4; when y = 0, we must have CL1 = 0 and CL2 = 0. We also specify P (Hi = 1, Hj = 0|CLk = 0) = P (Hi = 0, Hj = 1|CLk = 0) = 0.4, and P (Hi = 0, Hj = 0|CLk = 0) = 0.2, where CLk is one of the two clauses, Hi and Hj are two components of the corresponding clause; obviously, when CLk = 1, both of the two component must be 1. Finally, we specify P (i = 1|Hi = 1) and P (i = 0|Hi = 0) for i = 1, · · · , 4 (observe that when these two probabilities are set to be 1, it will be a deterministic logical expression). The components 5 , · · · , 50 are unrelated to the logical expression and they are sampled uniformly at random from {0, 1}. We generated 1000 examples for each case of y = 0 and y = 1. We used the obtained data as input to the second CNLL and AdaBoost algorithm, to compare their performance on this noisy-XOR data. Again we expect CNLL to outperform AdaBoost since the data can be sparsely described by a noisy-logical distribution but not by an exponential model. We consider two cases: (I) P (i = 1|Hi = 1) = 0.95 and P (i = 0|Hi = 0) = 0.95. (II) P (i = 1|Hi = 1) = 1 and P (i = 0|Hi = 0) = 1. CNLL performs well for both cases and obtains the correct DNF logical expression. It selects the correct causal cues 1 , ..., 4 and ignores the rest. For case (I), CNLL has training and testing error rates of 5% and 7%, respectively. For case (II), CNLL has 0% error rate in both training and testing stages. AdaBoost performs comparatively poorly. For case (I) it has training and testing errors of 21% and 24% respectively. For case (II) it has training and testing errors are 16% and 20% respectively. In both cases, AdaBoost algorithm selects 20 features. In the first few Compositional Noisy-Logical Learning iterations, AdaBoost selects the correct causal cues ­ 1 , · · · , 4 ­ but at later iterations AdaBoost also selects some of the irrelevant features 5 , · · · , 50 . Overall the performance of AdaBoost is poor at both classification and knowledge extraction. 4.2. Results on Standard Datasets and Comparison to AdaBoost We now compare CNLL with AdaBoost on four standard datasets (breast cancer, ionosphere, splice, and ocr49) which are available from the UCI repository and for which AdaBoost results are reported. The generating processes are unknown for these datasets, so it is unclear whether exponential or noisy-logical models would give better fits. Table 2 shows the size and the number of attributes of the datasets. We compare our results against those of AdaBoost by using stump features reported in (Reyzin & Schapire, 2006). We use similar stump features for CNLL. Table 3 shows the results of CNLL and AdaBoost, and Fig. 5 shows the average training and testing error curves of CNLL. Our results show that we obtain significantly better results than AdaBoost on all four datasets. Moreover, CNLL outputs simpler representations using a far fewer causal features. We typically obtain good results with around 10 features, while AdaBoost typically uses O(100). CNLL shows a greater tendency to over-generalize than AdaBoost, see Fig. 5. While obtaining comparable results to AdaBoost on the test set we obtain significantly better results on the training set (sometimes by a factor of three). This phenomena may be partly explained because CNLL gives a richer class of decision boundaries than AdaBoost (this can be seen by examining the form of log P (y=1|x) P (y=0|x) for noisy-logical distributions). Another reason might be that the CNNL algorithm is greedy and might get trapped in a local minimum. We hope that better understanding of CNLL may help us take advantage of the low training errors and obtain even better results on the test set. Table 2. The sizes of the training and testing datasets for each of ten trials and the number of attributes. These trial data samples are randomly selected from the datasets in the UCI repository. Table 3. The testing errors of CNLL and AdaBoost on four datasets (averaged over 10 trials for each dataset). Observe that CNLL gives better results than AdaBoost in all cases. AdaBoost was run for 500 rounds of all trials hence using roughly 500 weak classifiers. CNLL required much fewer weak classifiers in general ­ on average 9 for cancer, 5 for ions, 15 for OCR49, and 15 for splice. AdaBoost CNLL cancer 4.29% 1.74% iono 9.58% 6.11% ocr49 6.28% 5.83% splice 6.79% 4.70% clauses CL to extract knowledge about the structure of the data. We picked one of the trial results on the cancer dataset where CNLL output the logical expression f ({H1 , · · · , H9 }) = CL1 CL2 CL3 CL4 , with CL1 = H1 H2 H3 H5 H8 H9 , CL2 = H4 H5 H8 H9 , CL3 = H6 H8 H9 , and CL3 = H7 H8 H9 . Table 4 shows the performance of the clauses: the first and second rows show the true positive and error rates for each clause. The third and fourth rows show the accumulated true positive and error rates (combining the clauses). For this example, we see that the first clause is clearly dominant and explains most of the error. We observe similar results on other trials for these datasets ­ dominance by a single AND clause. This suggests that either these datasets have limited OR structure, or that CNLL is unable to extract it. 5. Conclusion This paper conjectured that the effectiveness of AdaBoost in terms of performance, and particularly knowledge extraction, depends on whether the conditional distribution for the data can be approximated by a sparse exponential distribution. We described how conditional distributions could also be represented as noisy-logical distributions (Yuille & Lu, 2008) and conjectured that for some applications this would lead to sparser representations than those obtained by exponential distributions. We developed Compositional Noisy-Logical Learning (CNLL) as a way to learn two-classe classifier by composing elementary noisy-logical distributions. We specified two CNLL algorithms both of which proceeded by coordinate descent on a training error measure. This is slightly more complex than learning for AdaBoost: (i) it requires performing two searches for values in range [0, 1], (ii) the training error is not convex (unlike the measure used in AdaBoost). Nevertheless, the training time increases with respect to AdaBoost only by a constant factor (due to the searches Training Testing Attributes cancer 630 69 10 iono 315 36 34 ocr49 1000 5000 64 splice 1000 2175 60 We now attempt to use the hidden variables H and Compositional Noisy-Logical Learning 5 Error Rates (percent) 4 3 2 1 0 1 3 5 7 9 11 Round 13 1516 Training Error Testing Error 15 Error Rates (percent) Error Rates (percent) 12 9 6 3 0 1 3 5 7 Round 9 10 Training Error Testing Error 18 15 12 9 6 3 0 1 5 9 13 Round 17 20 Training Error Testing Error 24 Error Rates (percent) 20 16 12 8 4 0 1 5 9 Round 13 16 Training Error Testing Error (a) Breast cancer (b) ionosphere (c) OCR49 (d) splice Figure 5. Average training and testing error curves on UCI data sets. CNLL gives good results for five causal features. Table 4. Clause analysis for one of the trials on cancer dataset, see text for details. Clause Individual True Positive Individual Error Rate Cumulative True Positive Cumulative Error Rate CL1 0.9737 0.0206 0.9737 0.0206 CL2 0.0310 0.6444 0.9785 0.0175 CL3 0.2792 0.4794 0.9809 0.0159 CL4 0.8878 0.0778 0.9833 0.0143 in [0, 1]). The resulting distribution is expressed using hidden units with DNF and has a simple transparent form. We tested CNLL on two types of data: (i) data generated to be in noisy-logical form, and (ii) four standard datasets from the UCI repository. Both sets of results showed the effectiveness of CNLL both in terms of performance and for knowledge extraction. CNLL was much more successful than AdaBoost for the noisylogical data and was able to identify the underlying logical process. CNLL also outperformed AdaBoost on the four datasets from the UCI repository both in terms of classification performance and in terms of the sparseness of the resulting representation (by an order of magnitude) enabling knowledge extraction. Note that CNLL takes longer to train than AdaBoost by a constant factor, but the number of weak classifiers required by CNLL is smaller (by an order of magnitude), so the total amount of training time is roughly compatible. Moreover, the resulting CNLL classifier is faster in testing because of its reliance on fewer weak classifiers. This may be an advantage for practical systems (e.g., implementing a face detector on a cell phone). References Alpaydin, E. (2004), Introduction to Machine Learning, MIT Press. Cheng, P. W. (1997) From covariation to causation: A causal power theory. Psychological Review, 104, 367­405. Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. Proceedings of 13th International Conference on Machine Learning, 148-156. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive Logistic Regression: a Statistical View of Boosting. Annal of Statistics 28, 337­407. Griffiths, T. L., & Tenenbaum, J. B.(2005). Structure and strength in causal induction. Cognitive Psychology, 51, 334­384. Meyer, Y. (2001). Oscillating Pattern in Image Processing and Nonlinear Evolution Equations. University Lecture Series Vol 22. American Mathematical Society. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann. Reyzin, L., & Schapire, R. (2006). How Boosting the Margin can also boost classifier complexity. 23rd International Conference on Machine Learning, 753760. Yuille, A. L., & Lu H. J. (2008). The Noisy-Logical Distribution. Advances in Neural Information Processing Systems. Acknowledgments We acknowledge the support from the Air Force FA9550-08-1-0489. We appreciate conversations with YingNian Wu and HongJing Lu.