Learning Random Monotone DNF Under the Uniform Distribution Linda Sellie University of Chicago, Chicago IL lmsellie@uchicago.edu Abstract We show that randomly generated monotone c log(n)-DNF formula can be learned exactly in probabilistic polynomial time. Our notion of randomly generated is with respect to a uniform distribution. To prove this we identify the class of well behaved monotone c log(n)-DNF formulae, and show that almost every monotone DNF formula is well-behaved, and that there exists a probabilistic Turing machine that exactly learns all well behaved monotone c log(n)-DNF formula. |s| = (n) = log log 3 n. The medium and small subset properties guarantee that with high probability, if s s has size at most (n) and there exists a term t f such that s t, then s is generated by this process. At this point, the large smoothness property comes into play and guarantees that the previous t is unique, and therefore can be efficiently found. In this way, we find all terms t of f in polynomial time. 1.1 Motivation and Past Work Mentioning DNF, Valiant [12] states: The possible importance of disjunctions of conjunctions as a knowledge representation stems from the observations that on the one hand humans appear to like using it, and, on the other, that there is circumstantial evidence that significantly larger classes may not be learnable in polynomial time. Many learning theorist have considered learning monotone DNF formula. Angluin [2] completely solved this problem for the case of exact learning using membership queries -- all monotone DNF are learnable in polynomial time in this model for all distributions. This problem has proven more difficult if the learner is restricted to sampling, i.e. learning by example. The obstacle seems to be "cluster structure" within the formula, specifically a relatively large set of variables common to a relatively large number of clauses. Existing results in the literature tackle this obstacle in two different ways. (1) allow the running time of the learner to explode in the face of such clusters, e.g. Verbeurgt [13] learns any poly(n)-size DNF in time nO(log(n)) from uniform examples. Or (2) consider classes of formula that do not contain such clusters, specifically by random generation and limited number of terms, e.g. Servedio [9] learns any 2 log(n) -term DNF in polynomial time from a product distribution. Other researchers have used similar approaches to other problems, [10], [8], and [6]. The results of this paper belong to group (2). Our result is distinguished from Servedio [9] in that our definition of well behaved represents an initial attempt to formalize the obstacle, and to obtain the best possible result based on that formalization. From this, we obtain conditions of greater generality. Despite the difficulty of learning monotone DNF with random examples drawn from the uniform distribution, the 1 Introduction Intuitively, a monotone c log(n)-DNF, f , is well behaved if it satisfies three smoothness criteria--the "small", "medium," and "large" z properties--that collectively rule out having an unexpectedly large number of terms having a common subset of the variables. Thus by removing terms we maintain our well-behaved criteria and we have: Theorem (Subset property of the set of well-behaved functions). If f is well-behaved and f contains a subset of the terms of f then f is also well-behaved. The question of what is meant by "a randomly generated monotone c log(n)-DNF formula" is somewhat application specific, but because of the subset property of the set of wellbehaved functions, our learning algorithm and proof of correctness is quite robust. We imagine a process that randomly selects m terms of size c log(n); we show that such a function will be well behaved with high probability as long as m 2 log log(n)nc (where roughly log1 n) of the exam( ples will be false when m = 2 log log(n)nc .) This subsumes standard notions of randomness that are intended to generate formula which are expected to be true with fixed probability less than one. For functions with the small and medium smoothness properties and for a set of variables, s, of bounded size, we can efficiently determine by sampling whether or not there exists a term t f such that s t with high probability. Our algorithm considers all subsets of variables, s, of a given, fixed size. To extend s we make multiple trials of random extension of s, through Computer Science Department, University of Chicago. naturalness of the class suggests in some restricted form, it must be possible to learn. In their 1994 paper, Aizenstein and Pitt proposed learning most DNF instead of all DNF. They defined "most" as the DNF generated randomly with certain parameters set, one parameter is choosing the variables in a term with probability 1 . They left as an open question a more 2 natural setting of those parameters. Jackson and Servedio in 2006 started answering the open question of Aizenstein and Pitt in their paper [7]. They learned "most" monotone DNF where the number of terms is bounded by O(n2- ) with fixed term size, log m, where m is the number of terms. We continue this work left open by Aizenstein and Pitt, and Jackson and Servedio. We expand the approach used by Jackson and Servedio in their paper [7]. To learn random monotone DNF with O(n2- ) number of terms, they use a clustering algorithm after using an inclusion/exclusion pair finding algorithm. In our paper, we learn O(nc ) number of terms in polynomial time for any constant c, and fixed term size, c log(n). Similar results are independently obtained by Jackson, Lee, Servedio and Wan [5] but are slightly weaker. They use a similar algorithm but significantly different underlying proofs. Theorem 1. Given a random monotone DNF, f , Algorithm Learn Random Monotone DNF finds f in polynomial time with high probability. 1.2 Our Model and Random Functions over n variables, with terms of size k , and m terms. Or interest is when with m 2k+1 c log log(n) where c = logk n) . ( Each term is selected independently and uniformly from the set of all k -variable terms. 2 Notation and Definitions Our function will be defined on n variables; we let X = {x1 , x2 , . . . , xn } be the set of variables. For s X we define X \s = {x X | x s}. Let t X be a term, and k = |t| = c log(n) be the size of a term. Let m be the number of terms; where m mmax = 2k+1 c log log(n) = 2nc log log(n). Let f = ti . We define f \t = {t f | t = t}. n Let E = {0, 1} be the set of all examples, and E + = {e E | f (e)} be the set of all positive examples. Let s X , and let as be a partial assignment of the variables in s. For x s, and as a partial assignment, then by an abuse of notation, we define x(as ) = 1 iff the assignment to x is 1 and 0 otherwise. Let Xas = {x s | x(as )} be the set of variables in s that as satisfies. We use as to partition the set of examples, E + , and the set terms in f . We then explore the relationship between these sets in the paper with an inclusion/exclusion algorithm that allows us to find subsets of terms in f . Let Eas = {e E | x s, x(as ) = x(e)} be the set of assignments + who agree with as on all the variables s. Let Eas = Eas + E be the set of positive assignments who agree with as on all the variables s. Let Tas = {t f | t s = Xas }. Let Te = {t f | t(e)} be the set of terms satisfying an example e. Let #0 (as ) = | {x s | x(as ) = 0} |, e.g. #0 (10110) = 2. Let (n) = log log( 3 n). We use (n) throughout the paper as a bound that is not constant ­ and not too large. For technical ease we chose this value of (n), we could have chosen other values for (n), such as (n) = log log(n). i For ease of notation, we define ij = (i-!j )! . For e and as , we define a transformation, e = esas to be x s, x(e ) = x(e) and x s, x(e ) = x(as ). For e and x we define e = eflip(x ) to be x (X \ {x }), x(e ) = x(e) and x (e ) = 1 - x (e). Our algorithm discovers f by finding subsets of terms in f . We use the knowledge that a set of variables, s, is a subset of a term iff there is a positive example, e, which becomes false for eflip(x) for any x s. Definition 2. e is s-minimal for f iff · e E + , and · x s, ex0 E + . We define the set of s-minimal examples: Definition 3. Let s = {e E + | e is s-minimal}. Thus, given any e s and t f , if t(e) = 1 then s t. The idea behind our proof is that we can determine if s is non-empty for any s of cardinality greater than c + 1 and less than or equal to (n) + 1, thus finding a subset of a term. i=1...m Continuing the work of Aizenstein and Pitt [1] and Jackson and Servedio [7], we explore learning a function chosen randomly from a large class of functions. Jackson and Servedio learn a monotone DNF formula chosen randomly from a subclass of monotone DNF; we do the same except we choose a la e p rger subclass of monotone DNF. As in Jackson and Servn dio, we randomly choose the terms for our function from k ossible terms of size k . We differ from Jackson and Servedio's choice of a class of functions in two ways. The most important is that we learn functions with nc terms for any c, while they learn only for c 2 - for > 0. The second way we differ is by loosening Jackson and Servedio's restriction which bounds the function away from 0 and 1 by a constant; we restrict our attention to functions that are bounded away from one by a slow growing function in n, and without restriction on how close the function is to zero. Even in the case of c 2, for large n, the set of functions they learn is a subset of the functions we learn. They allow the number 2 of terms, m, to be 2k m 2k+1 ln for a constant , (0 < < 0.09). Instead, we restrict the number of terms, m, to be m 2k+1 c log log(n). As Jackson and Servedio in [7]; we learn in the uniform distribution model; where each example is chosen uniformly at random and labeled according to the unknown function. Our goal is stronger than theirs, in that we exactly learn with probability 1 - . (They learn a function which is c lose with probability 1 - .) We run in time polynomial in the probability of an example satisfying a term, (i.e. time polynomial in 2k .) The model for our class of random monotone DNF formulas is as follows, let F n,k,m be the set of monotone DNF Function Distinguishing Subsets · s · S= X | |s| = c + 2, and Is > 2n · For i = (c + 3) to (n) ­ S = ­ For s S and x X If Is{x} > 2n · ­ · Algorithm Learn Random Monotone DNF 1 nc + 5 1 · S = Distinguishing Subsets · f = · For s S 1 1 nc + 5 then add (s {x}) to S S=S ­ t= ­ For x X If Is{x} > 2n · ­ add t to f · Return f 1 nc + 5 1 then add x to t Return S Figure 1: Function Distinguishing Subsets From this knowledge we build the rest of the term. Our goal is to exactly learn f with probability 1 - . Unfortunately we don't know how to compute the size of s . Instead we estimate |s |. a + Definition 4. Let Is = (-1)#0 (as ) |Eas |. s Our paper focuses on proving that most monotone k DNF are well behaved, that for well behaved functions Is approximates |s | for c + 2 |s| (n) + 1, and if s t f then s is sufficiently large. The organization of our paper is as follows: in Section 3 we present a simple algorithm that exploits the knowledge that we can find a subset of at term. In Subsection 4.1 we partition the set of positive examples and use this partition to define how Is miscounts the size of s . In Subsection 4.2 we prove most f F n,k,m are well behaved. In Subsection 4.3 we bound by how much Is misclassifies |s | for well behaved functions. In Subsection 4.4 we prove that for wellbehaved f R F n,k,m , if s t f then s is sufficiently large enabling us to discover if s t. We put some of the technical details into the appendix. In Appendix 1A, we provide some observations and simplifications of algebraic expressions used in our proofs. In Appendix 1B, we prove that most f R F n,k,m are well behaved. In Appendix 1C, we bound t for well behaved DNF. In Appendix 2A, we use standard sampling techniques + to prove we can approximate |Eas | sufficiently. In Appendix 2B, for the sake of completeness, we provide the details that show our algorithm finds the unknown monotone DNF in polynomial time with high probability. Figure 2: Algorithm Learn Random Monotone DNF 4 Approximating s by Is In this section, we show that with high probability Is approximates |s | for c + 2 |s| (n) + 1 to within n n 2n · 4k log c(+1)n (i.e. |Is - |s || < 2n · 4k log c(+1)n .) We n n use Subsections 4.1 through 4.3 to prove this main theorem. 1 In Subsection 4.4, we prove that |s | 2n · 8 log1c (n) nc if 4 and only if s t f . 6 2/3 6 2/3 4.1 Observations about Is To explore how Is relates to the size of s , we partition the set E + of positive examples. We partition E + by grouping + examples that "map" under s to the same example in E1s . Observing the behavior of a partition during the calculation of Is , we bound how Is misjudges the size of s . (We bound the size of the miscalculation in Subsection 4.3.) + Definition 5. For s X , and e E1s , we define a set of partial assignments, Ae,s = {as | esas E + } , which map e to another positive example under s. Next, using this partition of the set of positive examples, + we define a criteria for e E1s to be correctly counted. + Definition 6. For e E1s we define a Ie = (-1)#0 (as ) . s Ae,s Observation 7. Is = e + E1s Ie . 3 The Algorithm for Finding f Using Is Using Is as our estimate for |s |, our algorithm builds terms in three stages. First our algorithm tests all subsets of size c + 2, selecting those that are a subset of a term in f . Next, it builds upon these subsets, variable by variable, till it has found all subsets of terms of f of size (n). Finally, having a subset unique to a single term in f (we prove the uniqueness of terms of size (n) later in the paper in Corollary 46,) we find the rest of variables for this term. The steps of the first two stages are in Figure 1 and the steps for the third stage are in Figure 2. + + Definition 8. An e E1s is correctly counted iff e E1s \s then Ie = 0, and if e s then Ie = 1. + Note, if all examples in E1s are counted correctly then Is = |s |. We observe that e s is correctly counted. Lemma 9. For all e s then Ie = 1. Proof: Ae,s = {1s } . By characterizing examples which are correctly counted, we restrict the number of examples that could be incorrectly counted. We describe two ways examples are correctly counted. + Lemma 10. An example, e E1s , is correctly counted if x s such that as Ae,s , (as )flip(x) Ae,s . 4.2 Properties of Well Behaved Functions Proof: Let x be such that as Ae,s and (as )flip(x) Ae,s then (-1)#0 (as ) and (-1)#0 ((as )flip(x) ) are included in the sum, where the parity of #0 (as ) and #0 ((as )flip(x) ) are opposite. Thus Ie = 0 and e is counted correctly. + Corollary 11. An example, e E1s , is correctly counted if t f such that t(e) and t s = . In this subsection, we describe the properties a function needs for our proof to hold; our algorithm works for functions that are not "clustered" together. We prove that with high probability these properties hold for f R F n,k,m . We will call DNF formulas that have this property "well behaved." Definition 19. A monotone DNF function, f F n,k,m is well behaved iff for all s t f where |s| (n) + 1, and as where z = #1 (as ) then · Small z property: if 0 < z c then |Tas | < 3mmax k z /nz , · Medium z property: if c < z < (n) then |Tas | < (n), and · Large z property: if z (n) then |Tas | 1. Using Chernoff bounds we prove random monotone DNF are well behaved with high probability. Theorem 20. For a fixed c and sufficiently large n, if f R F n,k,m for m 2k+1 c log log(n) then f is well behaved 1 ( n) with probability at least 1 - n2c log(n) n . Proof: This follows from Corollaries 42, 44,and 46 (found in the appendix,) and noting that the probability of small, medium and large z properties of being well behaved are not 1 (n)-1 satisfied with probability at most 1 n2c log(n) n + 3 1 (n)-1 1 2c 1 (n)-1 1 2c + 3 n log(n)/3 n . Con3 n log(n)/3 n sequently f R F n,k,m is well behaved with probability at 1 ( n) least 1 - n2c log(n) n . 4.3 Observations about well behaved Monotone DNF Formulas Proof: The Corollary follows from Lemma 10 and the definition of Ie since all partial assignments are contained in Ae,s . If e is not known to be correctly counted by Lemma 10 and Corollary 11, it may or may not be correctly counted, but our proof will not need to consider this option. Definition 12. Let s X , we define the set of miscounted examples by e . + Ms = E1s \s | Ie = 0 Partitioning Ms based on sets of partial assignments, we simplify bounding the number of miscounted examples. Definition 13. Let s X , and A {0, 1}|s| ; we define Ms,A = {e Ms | Ae,s = A}. as iff x s then We define a partial order by as s s s x(a ) x(a ) and a = a . The smallest partial assignments are very important to our proof; they determine if an + example e E1s is miscounted. s Definition 14. Let A {0, 1}|s| , we define L(A) = {as A | as A, as as } . Lemma 15. Let e Ms,A , then as L(A), t Tas where t(e). Proof: By definition 5, given any e Ms,A and as + L(A), e Eas such that e maps under s to e (i.e. e = s e 1s ) which implies e = esas . Because e is Xas -minimal, we know t Tas such that t(e ) which implies t(e) since f is monotone. We have now proved in Lemmas 10 and 15 that every miscounted example is satisfied by a set of terms whose union contains s. We will use this fact in Lemma 24 where we bound the number of miscounted examples in Ms,A . Knowing A is a subset of the partial assignments to s, we calculate by how much an example has been miscounted. Observation 16. Let e Ms,A then |Ie | < |A| 2|s| . Definition 17. Let As = {A|A = Ae,s for an e Ms }. Observation 18. Is - |s | = e Ms In this subsection, we derive some properties of well behaved functions. First, we bound the number of variables that occur in more than one term from a set of terms, T f for f R F n,k,m . Next we bound the probability an example satisfies every term in T . Third, we bound the size of Ms,A , using the probability an example satisfies a term in Tas for every as L(A). At the end of this subsection we bound |Ms | and Is - |Ms |. Corollary 21. Let f be a well behaved monotone k -DNF formula and T f , then |{x | x (t t ) for some t, t T }| < |T |2 (n). Proof: For f , a well behaved monotone k -DNF, we know that a pair of terms t, t f have in common at most (n) |, variables. Since the number of pairs is T | we bound the 2 total umber of variables used by more that one term by |T | n (n). Note that what we've proved is stronger than 2 what we've claimed. The form of our claim is for our subsequent technical convenience. Knowing an upper bound on the number of variables occurring in a set of terms, we bound the probability an example satisfies every term in this set of terms. Ie = A e As Ms,A Ie . Lemma 22. Let f be a well behaved monotone k -DNF, and T f a subset of terms then | {e E | t T , t(e)} | 2n · 1 2(|T |k-|T |2 (n)) . max {|L(A)|, |s|} (and since v = |L(A)|.) This implies that |Ms,A | 2n · = kw nw 2 2v (n) (6c log log(n))v ncv k w 2n · 2v k nw w 2 k 2n · 2v (n) (6c log log(n))v w (ncv = 2kv .) n k c+2 2n · 3 n(6c log log(n))c+2 c+2 n (From Obs. 35, w v , and w |s| c + 2.) 1 2n · c+1 (From Observation 33.) n 3v mv ax m v k-v 2 (n) 2 Proof: The T terms share at most |T |2 (n) variables out |T |k variables by Corollary 21. Thus the number of variables that need to be satisfied is at least |T |k - |T |2 (n). We note that if we restrict our examples to have the bits in s set to one, we get the following corollary. Corollary 23. For s X and | {e E | t T , t(e)} | 2n · 2(|T |k-|1 |2 (n)) then T | {e E1s | t T , t(e)} | 2 · n 1 2(|T |k-|T |2 (n)) . Proof: The size of the set E1s is 2n-|s| . Given that | {e E | t T , t(e)} | 2n · 2(|T |k-|1 |2 (n)) , the restricT tion of the variables to be from the set E1s reduces the number of variables that must be satisfied to at least (|T |k - |T |2 (n) - |s|). (i.e. at most |s| bits were forced to one.) Thus | {e E1s | t T , t(e)} | 2n-|s| · 2(|T |k-|T 12 (n)-|s|) | = 2n · 2(|T |k-|1 |2 (n)) . T We now bound the number of examples in Ms,A . Lemma 24. For fixed c and sufficiently large n, let f be a well behaved monotone k -DNF, s X where c + 2 |s| og 5 (n) + 1, and A As then |Ms,A | < 2n · k lnc+(n) . 1 Proof: Let v = |L(A)|. As noted in Lemma 15, e Ms,A are satisfied by at least one term from every Tas for every as L(A). From Corollary 23, we know that the probability an example satisfies a 1 set of v terms in E1s is at most 2n · 2vk-v2 (n) Therefore we bound |Ms,A | by bounding the number of e Ms,A which is satisfied by at least one term from every Tas for every as L(A). We create this bound by using a Bonferroni type argument. |Ms,A | = | {e Ms | as L(A), t Tas , t(e)} | (Def . 13.) a 1 2n · vk-v2 (n) |Tas | (Lemma 15.) 2 s L(A) In the second case, there exists an as L(A) such that #1 (as ) > c; by f being well behaved we know that |Tas | < (n). Let v = | {as L(A)|#1 (as ) > c} |. If as L(A) where #1 (as ) c thenkby f be.ing well behaved we know #1 (as ) that |Tas | < 3mmax #1 (as ) Using these bounds, we compute an upper bound by again noting that e Ms,A is satisfied by one from each Tas for all as L(A). 2 n |Ms,A | 2n · ( (n))v v k-v 2 (n) a 3mmax · k #1 (as ) n#1 (as ) . s L(A),#1 (as )c By Lemma , we know #1 (as ) 1 for all as L(A), k# (as ) 11 k . 1 and #1 (as ) we reduce the formula so that n n |Ms,A | 2 2· 2· n n ( (n))v v k-v 2 (n) (3mmax ) v -v k (v-v ) k (v-v ) n ) n k (v -v ) 2 ( (n)) v v k-v 2 (n) (6c log log(n)2 ) c(v -v ) (since n ( (n)) 2· 2v k n v 22 =2 k(v -v .) ) v ( n) (6c log log(n)) (v -v k (v-v ) n In counting the number of possible ways an example e Ms,A could be satisfied by one term from every Tas , for every as L(A), we consider two cases. In the first case, we assume that for all as L(A) that #1 (as ) c. Using the assumptkon that f is well defined, i , #1 (as ) we know that |Tas | < 3mmax #1 (as ) we compute the n probability as follows. n We now break the calculations down into two sub-cases. If v = 2 then the equation is largest if v = 1. In this case k 24 (n) we bound |Ms,A | by 2n · (n)2k (6c log log(n)) n k < n log5 (n)k 43 2n · (n) log ( n) (6c log log(n)) n 2 · n2 k . 2k If v 3, we note 1 this equation is again largest if v = 1, and using Observation 35, we reduce the formula to: k (v-1) 1 (n) 3 n (v -1) 2n · (6c log log(n)) 2n · k . 2k n n2 og Therefore |Ms,A | 2n · k lnc+(n) . 1 Having computed an upper bound on the number of miscounted examples in Ms,A , we now bound |Ms |. 1 Argument here passes over a minor potential difficulty. i.e. if v is large, Corollary 23 does not come into play -- but the crucial fact is the nevertheless true as we show in Observation 32. 5 |Ms,A | 2 · 1 2vk-v2 (n) k a s L(A) #1 (as ) . 3mmax By Lemma 10 and Corollary 11, s tTe a Ae,s , #1 (as ) 1. Let w = #1 (as ) s L(A) n#1 (as ) a t nd as Corollary 25. Let f be a well behaved monotone k -DNF, and let s X where c + 2 |s| (n) + 1 then |Ms | < 2n · 4k log5 (n)n2/3 . nc+1 A Proof: This follows from |Ms | = As |Ms,A | < . 2 5 n k log (n) We note that |As | is bounded by the |As | · nc+1 number f subsets of the subsets of s, i.e. 22 22 o = log log( 3 n)+1 22 4n 2 / 3 . 5 n 2/3 Thus |Ms | < 2n · 4k log c(+1)n . n Knowing |Ms |, we now compute the difference between Is and |s |. This bound is computed by multiplying |Ms | and a bound of how large the misclassification is for an example. Theorem 26. Let f be a well behaved monotone k -DNF formula, and s X where c + 2 |s| (n) + 1 then |Is - |s || < 2n · 4k log6 (n)n2/3 . nc+1 Ms Ie . 5 2/3 n 4k log (n)n |Ms | < 2 · . nc+1 that that for all e, |Ie | | s| (n)+1 The previous theorem shows there exists a large gap that reliably determines if s t f for c+2 |s| (n)+1 by computing Is . This means that given a small set s t f and x X \s we can determine whether or not sx t f , and this is the key to our algorithm. In this section, we proved we could determine if a set, s, is a subset of a term if c + 2 |s| (n) + 1 by computing Is . Unfortunately, we cannot efficiently compute Is + since we cannot compute |Eas | in polynomial time. Instead we approximate Is using standard sampling techniques. We estimate this value by sampling gs = n2c+3 2k+|s| uniformly chosen labeled examples from E . Thus we can effectively estimate Is with high probability. Details can be seen in Appendix 2 in Subsection A. Our fairly straightforward algorithm is easily adapted to use our sampled values of Is , and thus runs in polynomial time in n and 2k . Details can be found in Appendix 2 in Subsection B. 5 Future Work e Extensions of the ideas presented here can also handle the non-monotone case. We are currently writing up this case and checking the proofs. We are also working on relaxing the requirement that k is fixed. Proof: As noted earlier, Is - |s | = Using Corollary 25, we know From Observation 16, we know log( 3 n). Consequently, Acknowledgement 6 2/3 4k log (n)n |Is - |s || |Ms |log( 3 n) < 2n · nc+1 . I would like to thank Stuart Kurtz for many conversations and help with the presentation, and Carsten Lund for help with polishing the paper. 4.4 Bounding |s | References f \t, t (e)}. [1] H. Aizenstein and L. Pitt. On the Learnability of Disjunctive Normal Form Formulas. Machine Learning, 19:183, 1995. [2] D. Angluin. Queries and concept learning. Machine Learning, 2(4):319­342, 1988. [3] Canny. Lecture 10 CS 174 www.cs.berkeley.edu/ jfc/cs174/lecs/lec10/lec10.pdf. [4] J. Jackson. An efficient membership-query algorithm for learning DNF with respect to the uniform distributon. Journal of Computer and System Sciences, 55(3):414­ 440, 1997. [5] J. Jackson, H. Lee, R. Servedio and A. Wan. Learning random monotone DNF. Electronic Colloquium on Computational Complexity, Report No. 129, 2007. [6] J. Jackson and R. Servedio. Learning Random LogDepth Decision Trees under Uniform Distribution. SIAM J. on Computing, 34(5), 2005. [7] J. Jackson and R. Servedio. On Learning Random DNF Formulas Under the Uniform Distribution. Theory of Computing, 2(8):147­172, 2006. [8] E. Mossel, R. O'Donnell,R. Servedio. Learning juntas. STOC 206­212, 2003. [9] R. Servedio. On learning monotone DNF under product distributions. Information and Computation, 193(1):57­ 74, 2004. [10] S. Smale. On the average number of steps of the simplex method of linear programming. Math. Programming 27: 241­267, 1983. Definition 27. Let Ef \t = {e E | t Next we prove that every term has a high probability of being uniquely satisfied. Jackson and Servedio have a similar lemma, Lemma (3.6). Lemma 28. Let f F n,k,m be a well behaved monotone k + 1 DNF function, t f then |E1t - Ef \{t} | 2n · 8 log1c (n) nc 4 The proof of this lemma is found in Appendix 1 in Subsection C. We note that if f is a monotone DNF and e (Et - Ef \{t} ), then e t . Corollary 29. Let f F n,k,m be a well behaved monotone 1 k -DNF, and s t f then |s | > 2n · 8 log1c (n) nc . 4 The following theorem is crucial; it is the key computation we use in our algorithm Learn Random Monotone DNF. Theorem 30. Let f F n,k,m be well behaved, and let c + 2 |s| (n) + 1: n 1 · if s t f then Is 2n · 8 log1c (n) nc -2n · 4k log c(+1)n 4 n 6 2/3 , · if s t f then Is 2n · 4k log6 (n)n2/3 . nc+1 [11] Valiant, L.G. (1984). A theory of the learnable. Communications of the ACM, 27(11):1134­1142. [12] Valiant, L.G. Learning disjunctions of conjunctions. In Proceedings of the 9th n International Joint Conference on Artificial Intelligence, Vol. 1, pages 560­566, 1985. [13] K. Verbeurgt. Learning DNF under the uniform distribution in quasi-polynomial time. In Proceedings of the Third Annual Workshop on Computational Learning Theory, pp. 314326, 1990. [ACM:92571.92657]. 1.1, 2.2 APPENDIX 1 Lemma 36. For a positive integer a (n) and c log(n)/(3 log log(n)) then m k c+1 a m k c+1 a+1 > . a a+1 nc+1 nc+1 Proof: The proof follows from expanding the formulas: m k c+1 a+1 m k c+1 a > a+1 a nc+1 nc+1 1 > k c+1 ma ma+1 nc+1 a nc+1 a+1 a! (a + 1)! k c+1 . m-a 1> a + 1 nc+1 Observation 37. For positive integer k , Proof: The proof follows from and noting that 2k - 2i 2k 1 + 2i 2k 2i + 1 and 2k - 2 2k 1 + 2k 2 3 Thus 1 2k i 1 -k = 2 k =0...2 A Useful Observations We use the following observations and simplifications of algebraic expressions in our proofs. Observation 31. For f a well behaved monotone k -DNF, T f where |T | 2 log log(n) then |{e E1s | t(e)t T }| < 2n · nlog l1 g(n) for sufficiently large n. o Proof: First, we observe that if T T then {e E1s | t(e), t T } {e E1s | t(e), t T }. Therefore, using Corollary 23 we know given any T T where |T | = 2 log log(n), then |{e E1s | t(e), t 1 T } 2n · 22 log log(n)k-(2 log log(n))2 (n) < 2n · nlog l1 g(n) . o Observation 32. For a well-behaved monotone k -DNF, given A {0, 1}|s| where |{as A | #1 (as ) > c}| 2 log log(n) then |Ms,A | < Proof: Let | #1 (as ) > c}. Using Observation 31, if |A 2 log log(n) then |Ms,A | |{e E1s | as A , t Tas such that t(e)}| 2 log log(n) a n ( n) 2n · nlog l1 g(n) · nlog log(n) . The last |Tas | 2 o s A inequality follows from noting that #1 (as ) > c, |Tas | (n) by the large z property, and from noting that the product is maximized for |A | = 2 log log(n). Observation 33. For c a ( constant,nthen n = n(n - + 1) · · · (n - c) > nc+1 - c(c2 1)) c and nc+1 < nc+1 - n ( 2 ( + c(c+1)) c + c(c2 1)) nc-1 . 2 Observation 34. Let s X , e E + , and as = e|s if #1 (as ) c, as L(Ae,s ) then |L(Ae,s )| < |s| 2 . | Proof: This follows from observing that s| |s|c . c Observation 35. Let s X where |s| (n) + 1, and e 2 E + then if #1 (as ) c, as L(Ae,s ) then 2|L(Ae,s )| (n) < 3 n for large enough n. Proof: Let s X where |s| (n) + 1, and Ae,s be such that #1 (as ) c, as L(Ae,s ). Then using Observa(c+1)c tion 34 we know |L(Ae,s )| < |s| 2 . 2|L(Ae,s )| 2 (c+1)c 1 - 1 2k 2k 1 4 expanding the formula - 1 2k 3 2i+1 0, 1 . 4 i - (n)2 log log(n) . nlog log(n) = A {as A | 1 2k 2k i - 1 2k 1 . 4 c+1 Observation 38. For constant c and sufficiently large n, m kc+1 log(n) m kc+1 (n) 1 +m log(n) nc+1 < n(n)-1 . log(n) (n) nc+1 Proof: The proof follows from the following calculations: m k c+1 (n) log(n) (n) nc+1 m k c+1 log(n) +m log(n) nc+1 < log(n)(2c log log(n)) (n) nc (n) k (c+1) (n) (nc+1 - + (2c log log(n)) n (c+1)(c+2) c (n) n) 2 log(n)+1 c(log(n)+1) (c+1) log(n) n c+1 - (c+1)(c+2) c n 2 k log(n) ( n) 2 ,, «2 (c+1)c ( (n)+1) 2 ( n) (c+1)c log(n)(2c log log(n)) (n) k (c+1) (n) n ( n) - (c+1)2(c+2) + (2c log log(n))log(n)+1 nc k (c+1) log(n) n log(n) - (c+1)2(c+2) 1 n (n)-1 . ) ( n) = 2( (n)+1 (log log 3 n+1)(c+1)c log log 3 n <2 (c+1)c 3 < (log 3 n)(log log n+1) < 3 n. < The first inequality follows from Observmtio 33 and substian tution using m 2c log log(n)nc and i mi . The second inequality can be seen by multiplying the first summand c (n) c log(n) by 1/nc(n) and the second summand by 1/nc log(n) . The last 1/n 1/n ( n) n > n (n) - inequality can be seen by: - (c+1)2(c+2) log(n) n (n)(c+1)(c+2) (n)-1 n and - (c+1)2(c+2) > nlog(n) - 2 log(n)(c+1)(c+2) log(n)-1 n 2 Lemma 40 (Chernoff). Let < 2e - 1, µ be the expected value, and be a series of independent Poisson trials, then 2 Pr { > (1 + )µ} < e(-µ /4) . We now bound the number of terms in Tas for #1 (as ) c, with high probability. Lemma 41. Fix s X , as where z = #1 (as ) c, and k log2 (n), then | < c kz e-c log log(n)k . Prf R F n,k,m Tas | > 3mmax z n Proof: We note that z c < c log(n) = k , thus we know the bounds of Observation 39 hold. Let f be a random extension of f to mmax terms. Prf R F n,k,m {|Tas | 3mmax k z /nz } | Prf R F n,k,mmax Tas | 3mmax k z /nz | Prf R F n,k,mmax Tas | (1 + )E{Tas } e-u 2 and log(n)(2c log log(n)) (n) k(c+1) (n) + 1 (1- (n)(c+n )(c+2) ) 2 (2c log log(n))log(n)+1 nc k(c+1) log(n) « ,, (n)(c+1)(c+2)nlog(n)- (n)-1 nlog(n)- (n) - 2 < n. B Proving Random Monotone Functions are Well Behaved with High Probability In this sections we prove that most monotone DNF in F n,k,m are well behaved. For the small z property of being well behaved, bounding |Tas | for #1 (as ) c, we first find the expected value of |Tas |. Next we use Chernoff bounds to give an upper bound on how far |Tas | is away from the expected value with high probability. Observation 39. For s X , as where z = #1 (as ), and |s| k log2 (n) then kz kz m z < Ef R F n,k,m {|Tas |} < m z . 2n n Proof: We first observe that n-|s| n k z (n - k )|s|-z k-z =m Ef R F n,k,m {|Tas |} = m . n | s| k The upper bound follows by observing that (n - k )|s|-z | s| /4 , where µ = E{Tas } by Chernoff. We can obtain an upper bound for this expression from a lower bound for its unnegated exponent. Let = 2. µ 2 /4 > =µ mmax k z /2nz > 2nc c · log log(n)k z /2nz > c · log log(n)k c Since k z /2nz k c /2nc . | kz Tas | > 3mmax z n n (remember z k ,) and thus = (n - k )|s|-z nz (n - z) |s|-z 1 nz Therefore Prf R F n,k,m e-c log log(n)k . c Ef R F n,k,m {|Tas |} m kz . nz The lower bound follows from (n - k )|s|-z 1i n-k-i = z | s| n n-z-i n =0...|s|-z -1 1 1i k-z = - nz n-z-i =0...|s|-z -1 Corollary 42 (The Small z Property Holds with High Probability). Therefore for s t f where |s| (n) + 1 and z = #1 (as ) then Prf R F n,k,m {as , 0 < #1 (as ) 1 (n)-1 kz . c, |Tas | > 3mmax nz } < n2c log(n) n Proof: We assume c 1; if c < 1 then there does not exist a z since 0 < z c < 1 and z is an integer. If c 1 then the number of s t and as where z = #1 (as ) c is bounded by < kz | | s| m |s| =1,...,c z s|=1,..., (n)+1 1 nz By 1 1 k-z - n - |s| - k-z n - |s| |s| | s| > 1 - |s| k-z . n - |s| 1 > 1/2 z , n and thus < Next, we state the simplified Chernoff upper bound from Canny's lecture notes [3]; we use this Chernoff bound to bound the expected value. z m 2k z n |s|-z z . m k (n-|k) n s| < mmax · (n)k (n)+1 · 2 (n) 1 c+1 n. 3 (i.e. kFor a given term, the number of different sets of size s . is |s| The number of terms is m. For a given set, s, the |s . number of ways to choose z items from the set is z| ) Thus, by Lemma 41: kz Prf R F {as , #1 (as ) c, |Tas | > 3mmax z } n 1 c+1 -c log log(n)kc < ne 3 1 (n)-1 1 2c < . n log(n) 3 n Therefore, with high probability, the small z property for f R F n,k,m is proved. Next we prove the medium z property: that for #1 (as ) where c < #1 (as ) < (n) then |Tas | < (n) with high probability. Lemma 43. For fixed c, and sufficiently large n, let s X , and as with #1 (as ) > c then 1 (n)-1 Prf R F n,k,m {|Tas | (n)} < n . (n) is bounded by < | m s|= c ,..., (n)+1 k |s| ·2 z =c+1,..., (n) ( n) | s| z mmax · (n)k c+1 (n)+1 Proof: Let z = #1 (as ). If z > k then |Tas | = 0, since there does not exist a term with more that k variables. If z k then the probability a random term t Tas is s (nk-|z |) kz - < nz . Consequently, n (k ) Prf R F n,k,m {|Tas | (n)} m kz j 1 m-j j kz < -z j nz n = (n)...m m kz j 1 m-j j kz < -z nz n j = (n)... log(n)-1 m kz j 1 m-j j kz + -z j nz n =log(n)...m (i.e. s|= c ,..., (n)+1 |s| s the number of ways to find a subset of t f of size greater than c and | ess ithan or equal l z s| to (n) + 1. The sum s the number =c+1,..., (n) z of ways to choose a set of size z from |s| elements.) Therefore using Lemma 43, we know Prf R F n,k,m {as , c < #1 (as ) < (n), |Tas | (n)} 1 (n)-1 1 2c . 3 n log(n) n Consequently, the medium z property is also satisfied by a random f R F n,k,m with high probability. The large z property is that two terms in f overlap by at most (n); we prove this with a counting argument. Jackson and Servedio's paper [7] has a similar lemma, Lemma (3.5). Lemma 45. Let s, s X be sets of k 3 n variables chosen independently at random, then the Pr{|s s | (n)} < 1 (n)-1 . n Proof Pr {|s s | (n)} n n-j n-k n jk j k-j k-j = 2 = (n) k k n 1 2c < n log(n). 3 ki | = j = (n) (k j )2 (n - k )k-j j !nk (The sum is maximized for j = (n).) < < k (k (n) )2 (n)!n (n) 1 . (n)-1 n m < log(n) +m < (n) m k c+1 (n) nc+1 k c+1 log(n) nc+1 . log(n) 1 (n)-1 n Corollary 46 (The Large z Property Holds with High Probability). Therefore, Prf R F n,k,m {t, t f , |tt | (n)} < 1 (n)-1 1 2c . 3 n log(n) n Proof: The proof follows from noting that m 1 Prf R F n,k,m {t, t f , |t t | (n)} 2 (n)-1 n mmax 1 m 1 1 < max n(n)-1 (n)((n)-1) 2 2 n (n)-1 1- 2n 1 (n)-1 < 1 n2c log(n) n . 3 The third inequality follows from Obse1vation 33 and . r 1 n (n)-1 - ( (n)-1) (n) n (n)-2 = n (n)-1 - ( (n)-n ) (n) 2 2 Since two terms in f R F n,k,m share less than (n) variables with high probability, a random f R F n,k,m satisfies the large z property in being well behaved with high probability. The third inequality follows from the observation that the sum is maximized for z = c + 1, and from Lemma 36. The fourth inequality follows from Observation 38. Corollary 44 (The Medium z Property Holds with High Probability). Therefore for s t f where |s| (n) + 1, Prf R F n,k,m {as , c < #1 (as ) < (n), |Tas | (n)} < 1 (n)-1 1 2c for sufficiently large n. 3 n log(n) n Proof: The number of s t and as where c < #1 (as ) < Recalling Corollaries 42, 44, and 46 and the definition of "well behaved," we note that f R F n,k,m is well behaved with high probability. C Bounding s Next we present the proof of Lemma 28 that for f a well behaved monotone k -DNF function, m 2k+1 c log log n + + 1 and t f then |E1t - Ef \{t} | 2n · 8 log1c (n) nc . 4 Proof: Divide f \ {t} into three disjoint sets, · Tdisjoint = {t · Tsmall = {t (n) variables, and the number of terms overlapping by a set r t in Tnot small is at most (n). Therefore PreE + {t Tnot small , ¬t (e)} 1t 1 ( n) r |r | > - 22k t,c<|r | (n) 1 2 (n)((kn)) k k. (n) > - 2 2k 1 , (since |r| 2 ( n) ) Therefore (remembering 2k = nc ) e | + | R E1t | t f \ {t} , ¬t (e) 1 1= 1 1 1 > 2n-k · . 2n · 4 4c 4 2 log c(n) 8 log (n) nc f \ {t} | t t = } , f \ {t} | 1 |t t| c} and · Tnot small = f \(Tdisjoint Tsmall ). + Looking only at examples in E1t , we now calculate the probability that each of these sets is not satisfied. Remembering that f is monotone, we note that if one set is not satisfied, it increases the chance another set is not satisfied. (i.e. PreE {¬t | ¬t } PreE {¬t} since if we know at least one variable is set to zero that increases the odds of another term to be set to zero if they share a variable.) In the first case for Tdisjoint , PreE1t {t Tdisjoint , ¬t (e)} > (1 - 21 )m k ( 2c log log (n) 1 2k+1 c log log (n) 1 2k (1 - 2k ) = 1 - 2k ) APPENDIX 2 In the next two sections we present the standard arguments for the sake of completeness. In Section A we prove that we can sample to find a sufficient approximation to Is (E + ). In Section B we prove that our very straightforward algorithm runs in polynomial time and produces f . A Sampling and Approximating Is In Section 4, we proved we could determine if a set, s, is a subset of a term if c + 2 |s| (n) + 1 by computing Is . Unfortunately, we cannot efficiently compute Is since + we cannot efficiently compute |Eas |. Instead, we show how to approximate Is . We estimate this value by sampling gs uniformly chosen labeled examples from E . Definition 47. For s X , let ESample(gs ) E be a random sample of gs labeled examples drawn uniformly from E . + Definition 48. Given ESample(gs ) E , let ESample(gs ) = ESample (gs )E + be the set of positive examples in ESample(gs ) . Similary, let Samples (gs ) = ESample(gs ) s be the set of positive examples from ESample(gs ) which satisfy only terms in T1s . = Observation 49. Let s X , we note that E Samples (gs ) |s | gs gs · |E | = 2n |s |. = log41 (n) , by Observation 37. c In the second case, if t Tsmall and r = t t with at such that Xat = r then by. f being well behaved we know k |r | that |Tat | 3mmax |r| Therefore 1 42c log log(n) n PreE + {t 1t Tsmall , ¬t (e)} ,, > r t,1|r |c 1 - |r | 2| r | 2k 3mmax k |r | « |r | n 1 |r |c 1 2 -k 2 2k+3 c log log(n) ,, k |r | « |r | n (|k|) r 1 |r |c 1 2| r | -k 2 2(k-|r|) 2|r|+3 c log log(n) ,, k |r | « |r | n (|k|) r Using sampled labeled examples, we compute the following function to approximate Is . Definition 50. Let s X , we define Is,Sample(gs ) = e (-1)#0 (as (e)) to be our approximation of Is , E + Sample(gs ) 1 |r |c 1 2|r|+3 c log log(n) 4 . ,, k |r | « |r | n (|k|) r By Obs. 37. where as (e) is e|s . Observation 51. We note that the E I s,Sample(gs ) l 1 16c2 lognog(n)k2 = gs 2n Is . 4 The last inequality follows from noticing the product is maximized for |r| = 1, thus PreE + e 1t + R E1t | t Tsmall , ¬t (e) >1 . 4 We now bound the third case. Since f is well behaved, we know that a term in f overlaps another term by at most This observation follows since the expected value of + |Ea | + |ESample(gs ) Eas | is gs |Es and | + I =a |Ea | E s,Sample(gs ) (-1)(#0 (as )) gs |Es . | s Next, we bound how different our sampled Is,Sample(gs ) is from the expected value, As we have not yet provided a lower tail bound, we state it next, as it is described in Canny [3]. Lemma 52 (Chernoff). Let (0, 1], µ be the expected value, and be a series of independent Poisson trials then 2 Pr { < (1 - )µ} < e-µ /2 . Applying the lower and upper Chernoff bounds from Lem2 mas 40 and 52, we prove that Is,Sample(gs ) is within ncgs1 + I . fraction of E s,Sample(gs ) Lemma 53. For gs = n 2 and given access to exaI ples drawn from Ia well behaved monotone k -DNF then m < gs · nc2 1 with proba+ s,Sample(gs ) - E s,Sample(gs ) bility 1 - 4e-n/4 . Proof: To apply the Chernoff bounds, our main difficulty is our sum has both positive and negative values, we overcome this difficulty by bounding the positive and negative values separately. We define two indicator functions. Let reven (e) = 1 iff f (e) = 1 and #0 (e|s ) is even, and let rodd (e) = 1 iff f (e) = 1 and #0 (e|s ) is odd. Let ESample(g) be a randomly generated set of gs exame ples from E . Let Xeven = ESample (gs ) reven (e). (Similarly for Xodd .) We observe that Is,Sample(gs ) = Xeven - Xodd . + If as such that Eas = , then there is a term consistent with at least one as . This term satisfies the examples in Eas with probability at least 21 . There are 2|s| different k as , thus if #0 (as ) is even, we expect at least 2|s12k frac| tion of total examples are set to one by reven . Therefore in gs = n2c+3 2k+|s| examples, the expected value of the indicator function is either zero, or the expected value is at least n2c+3 . (Similarly for the case where #0 (as ) is odd.) Using the Chernoff bounds with = nc1 1 , we bound + E(Xeven ), in the cases where the expected value is not zero. Pr {|Xeven - (1 ± )E(Xeven )} 2e- 4n2c+2 = 2e-n/4 . (Similarly for E(Xodd ).) Consequently, the indicator functions will be nc1 1 close to their respective expected value + functions. Therefore we know |(Xeven -Xodd )-E(Is,Sample(gs ) )| 2 1 + nc+1 (E(Xeven ) I E(Xo dd )) gs nc+1 . Thus Is,Sample(gs ) b differs from E s,Sample(gs ) y at most gs · nc2 1 with high + probability. Using the previous Lemma 53, Theorem 26, and Observation 49, we note that we can determine if s X is a subset of a term in a well behaved monotone k -DNF function by sampling labeled examples from the uniform distribution. Lemma 54. Let f be a well behaved monotone k -DNF formula, s X where c + 2 |s| (n) + 1, and gs = n2c+3 2k+|s| ; · if s t f then Is,Sample(gs ) > gs · bility 1 - 4e-n/4 . · If s t f then Is,Sample(gs ) < gs · bility 1 - 4e-n/4 . 1 nc + 5 1 nc + 5 1 1 n2c+3 Algorithm Learn Random Monotone DNF 1. S = Distinguishing Subsets 2. f = 3. For s S (a) t = (b) For x X · If Is{x},Sample(gs ) > gs · to t (c) add t to f 4. Return f Figure 3: Function Distinguishing Subsets 1. S = {s X | |s| = c + 2, : and Is,Sample(gs ) > gs · c1 1 } + n 5 2c+3 k+|s| 1 nc + 5 1 then add x 2. For i = (c + 3) to (n) (a) S = (b) For s S and x X · If Is{x},Sample(gs ) > gs · ( 3 1 nc + 5 1 then add (s {x}) to S c) S = S . Return S Figure 4: with probability greater than 1 - 4e-n/4 . By Theorem 30, Observation 51, and Lemma 53 we know: n 1 · if s t f then Is 2n · 8 log1c (n) nc -2n · 4k log c(+1)n . 4 n 2 2 Thus, Is,Sample(gs ) -gs · nc+1 + gs · Is -gs · nc+1 + 6 2/3 gs · 1 1 8 log4c (n) nc - gs · probability greater than 1 - 4e 4k log6 (n)n2/3 nc+1 -n/4 > gs · 1 nc + 5 1 with . 6 2/3 n · If s t f then Is 2n · 4k log c(+1)n . Thus, n Is,Sample(gs ) gs · nc2 1 + gs · Is gs · nc2 1 + gs · + + 4k log6 (n)n2/3 nc+1 -n/4 < gs · 1 nc + 5 1 with probability greater than with probawith proba- 1 - 4e . B Learning Random Monotone DNF by Finding Terms in Polynomial Time Next, we restate our algorithm to use ISample(gs ) . Referring to our algorithm in Figures 3 and 4, the lemmas, and theorems in the previous sections, we prove our Proof: From Lemma 53 and Observation 51, we know 2gs gs 2gs gs - c+1 + n Is Is,Sample(gs ) c+1 + n Is n 2 n 2 algorithm discovers the unknown well behaved monotone k DNF from random examples drawn from the uniform distribution with high probability in polynomial time. We show this by following the steps our algorithm takes; first our algorithm finds all (c + 2)-sized subsets of s in time O(gc+2 nc+2 ) with probability greater than 1 - 4nc+2 e-n/4 . Next, given all (c + 2)-sized subsets of terms in f , our algorithm grows those subsets till they are of size (n) with probability greater than 1 - 4nmk (n) e-n/4 in time O(nmg (n) k (n) ). Finally, given a subset of a term of size (n), our algorithm discovers all the variables in that term in time mng (n)+1 k (n) with probability at least 1 - mnk (n) (4en/4 ). Observation 55. For s X , computing Is,Sample(gs ) takes time O(gs ). In step 1, our algorithm finds all the (c + 2)-sized subsets of terms in f . Lemma 56. Given a well behaved f F n,k,m , our function Distinguishing Subsets finds {s | s t f , |s| = c + 2} in time O(gc+2 nc+2 ) with probability greater than 1 - 4nc+2 e-n/4 in step 1. Proof: Let s X where |s| = c + 2. By Lemma 54, iff s t f then Is,Sample(gs ) gs · c1 1 with probability + greater than 1 - 4e-n/4 . Function Distinguishing Subsets tests all subsets of size c + 2, thus our function has correctly selected the sets which are subset of terms in f with probability greater than 1-4nc+2 e-n/4 in time O(gc+2 nc+2 ). Having found all subsets of t f of size c + 2 with high probability, our algorithm builds these sets till all the subsets of terms has size (n). Lemma 57. Given a well behaved f F n,k,m , and T = {s | s t f , |s| = c + 2}, function Distinguishing Subsets in step 2 returns {s | s t f , |s| = (n)} with probability greater than 1 - 4mnk (n) e-n/4 in time bounded by O(nmg (n) k (n) ). Proof: Using the result of Lemma 54, each iteration of our loop is given a set S = {s | s t f , |s| = i} and produces S = {s | k t f , |s| = i + 1} with probability se more than 1 - 4nm i -n/4 for i = c + 2 . . . (n) - 1 in time bounded by O(gi nmk i ). Thus in (n) - 1 - (c + 2) iterations our algorithm produces S = {s | s t f , |s| = (n)} with probability greater than 1 - 4 (n)nmk (n)-1 e-n/4 in time bounded by O(nmg (n) k (n)-1 ). Given all (n)-sized subsets of t f , algorithm Learn Random Monotone DNF finds all the terms of f . Lemma 58. Given a well behaved f F n,k,m , and S = {s | s t f , |s| = (n)} our algorithm, Learn Random Monotone DNF, finds f in time bounded by O(mng (n)+1 k (n) ) with probability greater than 1 - nmk (n) (4e-n/4 ) in step 3. Proof: Algorithm Learn Random Monotone DNF uses Corollary 46 and Lemma 54. Corollary 46 states that, for a well behaved monotone k DNF, s S where |s| (n) then |{t | s t f }| 1. Thus every s S is associated with at most one term t f . n 5 Lemma 54 states that for a given s X and x X where |s {x}| = (n) + 1 iff Is{x},Sample(gs ) gs · c1 1 + then s {x} t f with probability at least 1 - 4e-n/4 . Thus for |s| = (n), !t such that s t, we can determine if x t with high probability. Combining these ideas, given s S we can find a term in the inside loop of step 3 by testing every x X to determine if {x} s t f , and thus find {x | Is{x},Sample(gs ) gs · c1 1 } = t f in time O(g (n)+1 n) with probability + greater than 1 - 4ne-n/4 . Together, the outside loop in step 3 selects every s S and the inside loop finds t where s t. Since t f , there exists s S such that s t, Algorithm Learn Random Monotone DNF produces f . The time it takes to do this is the time is bounded by O(g (n)+1 nmk (n) ) with probability bounded by 1 - 4nmk (n) e-n/4 . n 5 n 5 Theorem 59. Given a well behaved f F n,k,m , Algorithm Learn Random Monotone DNF finds f in time bounded by O(mng (n)+1 k (n) ) with probability greater than 1 - 9mnk (n) e-n/4 . Proof: Using Lemmas 56, 57, 58 we have proven that our algorithm finds all subsets of size c + 2 of terms in f in Lemma 56, and having found these subsets it builds upon till our algorithm has found all subsets of terms of f of size (n) in Lemma 57; it then uses the uniqueness of terms of size (n) to find all the variables of a term in f ; thus finding the entire function. The algorithm runs in time bounded by O(mng (n)+1 k (n) ) with probability greater than 1 - 9mnk (n) e-n/4 .