Learning Hierarchical Rie Independent Groupings from Rankings Jonathan Huang Carlos Guestrin Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh PA, 15213 jch1@cs.cmu.edu guestrin@cs.cmu.edu Abstract Ried independence is a generalized notion of probabilistic independence that has been shown to be naturally applicable to ranked data. In the ried independence model, then in a second as if by one assigns rankings to two disjoint sets of items independently, stage, interleaves (or ries) the two rankings together to form a full ranking, shuing a deck of cards. Because of this A popular and highly successful approach for achieving such simplicity for distributions involving large collections of interdependent variables has been to exploit conditional independence structures (e.g., with naive Bayes, tree, or Markov models). With ranking problems, however, independence-based relations are typically trickier to exploit due to the so-called exclusivity mutual constraints which constrain any two items to map to dierent ranks in any given ranking. In a recent paper, Huang and Guestrin (2009) proposed an alternative notion of independence, called interleaving stage, it is much more dicult to detect ried independence than ordinary independence. In this paper, we provide the rst automated method for discovering sets of items which are rie independent from a training set of rankings. We show that our clustering-like algorithms can be used to discover meaningful latent coalitions from real preference ranking datasets and to learn the structure of hierarchically decomposable models based on ried independence. pendence, ried inde- which they have shown to be more natural In addition to being a natural way to for rankings. represent distributions in a factored form, the concept of ried independence leads to a natural way of grouping items together; If we nd two subsets of items, and A B to be rie independent of each other, then it is often the case that items in A (and items in B ) are as- sociated with each other in some way. Experimentally, Huang and Guestrin (2009) were able to demonstrate on a particular election dataset (of the American Psychological Association, (Diaconis, 1989)), that the political coalitions formed by candidates in the election were in fact approximately rie independent of each other. However, it is not always obvious what kind of groupings ried independence will lead to. Should fruits really be rie independent of vegetables? Or are green foods rie independent of red foods? In this paper, we address the problem of automatically discovering optimal rie independent groupings of items from training data. Key among our observations is the fact that while item ranks cannot be independent due to mutual exclusivity, relative ranks between sets of items are not subject to the same constraints. More than simply being a `clustering' algorithm, however, our procedure can be thought of as a structure learning algorithm, like those from the graphical models literature, which nd the optimal (ried) independence decomposition of a distribution. Structure learning algorithms are useful for ecient probabilistic representations and inference, and in the spirit of graphical models, we explore a family of probabilistic models for 1. Introduction Ranked data appears ubiquitously in various machine learning application domains. for in example, surveys, in reasoning results search in Rankings are useful, preference lists The information retrieval about applications, and ballots in certain elections. problem of building statistical models on rankings has thus been an important research topic in the learning community. As with many challenging learning problems, one must contend with an intractably large state space when dealing with rankings since there are n! ways to rank n objects. In building a statistical model over rankings, simple (yet exible) models are therefore preferable because they are typically more computationally tractable, less prone to overtting, and often more interpretable. Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Learning Hierarchical Rie Independent Groupings from Rankings rankings based on hierarchies of groupings using ried independence. We show that our hierarchical models are simple and interpretable, and present a method for performing model selection based on our algorithm. Our main contributions are as follows: makes probabilistic independence a tricky notion for rankings, since each pair of items is constrained to map to dierent ranks. Another way to see this is to imagine constructing a graphical model representing h in which each node corresponds to the rank of For each pair of nodes, mutual exclusivity · We propose a method for nding the partitioning of an item set such that the subsets of the partition are as close to rie independent as possible. In particular, we propose a novel objective for quantifying the degree to which two subsets are rie independent to each other. an item. induces an edge between those two nodes, leading to a fully connected graph (Huang et al., 2009). Ried independence, introduced recently in (Huang & Guestrin, 2009), turns out to be a more natural notion of independence in the ranking setting. vegetables (denoted by set set Consider our example of jointly ranking a set containing both · We dene a family of simple and interpretable distributions over rankings, called hierarchical rife independent models, in which subsets of items are interleaved into larger subsets in a recursive stagewise fashion. We apply our partitioning algorithm to perform model selection from training data in polynomial time, without having to exhaustively search over the exponentially large space of hierarchical structures. A) and fruits (denoted by The idea of ried in- B) in order of preference. dependence is to rst draw two independent rankings (one for vegetables, one for fruits), then to interleave the two sets to form a full ranking for the entire collection. The word rie technique called the rie shue comes from the popular shuing in which one cuts a deck of cards and interleaves the piles. For a formal denition, we establish the following notation about 2. Ried independence for rankings In this paper we will be concerned with distributions over rankings. where rank A relative rankings Sn , and interleavings. Given a ranking ranking = ((1), . . . , (n)) n items j th item is a one-to-one association between and ranks, (j) = i means that the is assigned i under . We will also refer to a ranking by its -1 inverse, (1), . . . , -1 (n) (called an ordering and -1 (i) = j also assigned rank i under . means that the denoted with double brackets instead of parentheses), where j th item is As a running example ( C) , for in this paper, we will consider ranking a small list of 6 items consisting of fruits and vegetables: and (G) Grapes. The ordering and so on. A distribution Corn, (P) Peas, (L) Lemons, (O) Oranges, (F) Figs, = P, F, C, . . . example, ranks Peas rst, Figs second, Corn third A {1, . . . , n}, let A () denote the ranks of items in A relative to the set A. For example, in the ranking = P, L, F, G, C, O , the relative ranks of the vegetables is A () = P, C = P eas, Corn . Thus, while corn is ranked fth in , it is ranked second in A (). Similarly, the relative ranks of the fruits is B () = L, F, G, O = Lemons, F igs, Grapes, Oranges . Likewise, let A,B () denote the way in which the sets A and B are interleaved by . For example, using the same as above, the interleaving of vegetables and fruits is A,B () = V eg, F ruit, F ruit, F ruit, V eg, F ruit . We will use A,B to denote the collection of all distinct ways of interleaving the sets A and B . Note n that if |A| = k , then |A,B | = k . and a subset h(), dened over the Sn , can be viewed as a joint distribution over the n variables ((1), . . . , (n)) (where (j) {1, . . . , n}), subject to mutual exclusivity set of rankings, Denition 1 (Huang and Guestrin (2009)). Let constraints which stipulate that two objects cannot h be a distribution over Sn and consider a subset A {1, . . . , n} (with |A| = k )and its complement B . The sets A and B are said to be rie independent if h factors as: h() = m(A,B ()) · fA (A ()) · gB (B ()), where and simultaneously map to the same rank, or alternatively, that two ranks cannot simultaneously be occupied by the same object (h((i) Since there are (2.1) = (j)) = 0 n whenever i = j ). n, and n! rankings for items, representing arbitrary distributions of distributions. probabilistic It h is not viable for large be tempting to it is typical to restrict oneself to simplied classes might think would that independence assertions m, fA , gB are distributions over A,B , Sk , Sn-k , respectively. We will refer to m as the interleaving distribution, and to fA and gB as the relative ranking distributions of A and B . We will also notate the relation as A m B . Ried independence can be seen as a generalization of full independence (when the interleaving distribution is deterministic). In addition to exploring theoretical properties, Huang and Guestrin showed be a useful simplication for ranking problems (as naive Bayes models have been useful in other areas of machine learning). However, mutual exclusivity Learning Hierarchical Rie Independent Groupings from Rankings { , , , , , } { , } CPLOFG { , } Citrus Vegetables CP { , , , } Fruits LOFG LO Mediterranean { , } FG CPLOFG {C,P,L,O} {F,G} {C,P} {L,O} { , , , , , } (b) Another example, not equivalent to (a) {A,B,C,D} {A,B,C} {D} {A,B} {C} {A} {B} (a) Example of hierarchical ried independence structure on S6 Figure 1. (c) Hierarchical decomposition into singleton subsets (each leaf set consists of a single item) Examples of distinct hierarchical rie independent structures sets are interleaved to form a ranking of all fruits that the ried independence relation in fact holds (approximately) in certain datasets. In particular, they showed on a particular election dataset, that the political coalitions in the candidate set were nearly rie independent of each other. In other nonpolitical ranking datasets, however, it is not always obvious how one should partition the item set so that the rie independent approximation is as close as possible to the true joint distribution. Is there any way to Before presenting discover latent coalitions of items arising from ried independence from ranked data? our solution to this question, we discuss a more G, L, O, F ). Finally, a ranking of the vegetables is P, C ) and interleaved with the fruit rankings to form a full joint ranking: P, G, L, O, F, C . ( drawn ( Notationally, we can express the hierarchical decomposition as {P, C} m ({L, O} m {F, G}). We can also visualize hierarchies using trees (see Figure 1(a) for our example). The subsets of items which appear as leaves in the tree will be referred to as A natural question to ask is: leaf sets. if we used a dierent hierarchy with the same leaf sets, would we capture the same distributions? For example, does a distribution which decomposes according to the tree in Figure 1(b) also decompose according to the tree in Figure 1(a)? The answer, in general, is no, due to the fact that distinct hierarchies impose dierent sets of independence assumptions, and as a result, dierent structures can be well or badly suited for modeling a given dataset. Consequently, it is important to use the correct structure if possible. We remark that while the two structures 1(a),1(b), capture distinct families of distributions, it is possible to identify a set of independence assumptions common to both structures. complicated but realistic scenario in which coalitions are further partitioned into smaller subgroups. 3. Hierarchical decompositions A (and its complement B ), there are O k! + (n - k)! + n model parameters (specifying k fA , gB , and m). Despite the fact that rie indeFor a xed subset pendent models are much more compact compared to full models over rankings, it is still intractable to represent arbitrary rie independent models for large n, and one must make further simplications. For example, Huang and Guestrin discuss methods for representing useful parametric families of interleaving distributions, with just a handful of parameters. We will explore the other natural model simplication which comes from the simple observation that the relative ranking distributions 4. Objective functions Since dierent hierarchies impose dierent independence assumptions, we would like to nd the structure that is best suited for modeling a given ranking dataset. The base problem that we now address is how to nd the best structure if there is only one level of partitioning and two leaf sets, we want to nd the topmost fA and gB are again distributions over rankings and so the sets subsets. We call such models A and B can further be decomposed into two rie independent pendent decompositions. hierarchical rie inde- A, B . Alternatively, of the partitioning Continuing with our running tree. In Section 5, we use this base case as part of a top-down approach for learning a full hierarchy. example, one can imagine that the fruits are further partitioned into two sets, a set consisting of citrus fruits ((L) Lemons and (O) Oranges) and a set consisting of mediterranean fruits ((F) Figs and (G) Grapes). To generate a full ranking, one rst draws rankings of the citrus and mediterranean fruits independently ( Problem statement. ings, Given a training set of rank- L, O and G, F , for example). Secondly, the two (1) , . . . , (m) h, in which a subset of items, A {1, . . . , n}, is rie independent of its complement, we would like to automatically determine the sets A and B {1, . . . , n}. If h does not exactly factor rie Learning Hierarchical Rie Independent Groupings from Rankings independently, then we would like to nd the rie independent approximation which is Internal Triplet closest to h in some sense. Formally, we would like to solve the problem: ^ arg min min DKL (h() || m(A,B ())f (A ())g(B ())), A m,f,g (4.1) where A Cross Triplet B ^ h is the empirical distribution of training examples and DKL is the Kullback-Leibler divergence measure. Equation 4.1 is a seemingly reasonable objective since it can also be interpreted as maximizing the likelihood of the training data. If A and B are Graphical depiction of the problem of nding rife independent subsets. Note the similarities to clustering. Double bars represent the direct nature of the triangles. Figure 2. truly rie independent, then 4.1 can be shown via the Gibbs inequality to attain its minimum, zero, at (and only at) the subsets in which we replace the mutual informations dened over all Given A and B. n variables by a sum of mutual informatriplet of distinct items, tions dened over just three variables at a time. any For small problems, one can actually solve Problem 4.1 using a single computer by evaluating the approximation quality of each subset 2009). Ii;j,k I((i) ; (j) < (k)). We dene A and taking the minimum, the set of triplets which cross from set (i, j, k), let cross to be A,B A to set B : int A which was the approach taken in (Huang & Guestrin, However, for larger problems, one runs into time and sample complexity problems since optimizing the globally dened objective function (4.1) requires relearning all model parameters (m, cross {(i; j, k) : i A, j, k B}. A,B cross B,A is similarly dened. We also dene to be the set of triplets that are internal to A fA , and each of the exponentially many subsets of gB ) for {1, . . . , n}. and again, int {(i; j, k) : i, j, k A}, A int B is similarly dened. Our proxy objective function can be written as the sum of the mutual information evaluated over all of the crossing edges: In the remainder, we propose a more locally dened objective function, reminiscent of clustering, which we will use instead of Equation 4.1. As we show, our new objective will be more tractable to compute and have lower sample complexity for estimation. ~ F(A) X (i,j,k)cross A,B Ii;j,k + X (i,j,k)cross B,A Ii;j,k . (4.3) Proposed objective function. the observation that The approach we ~ F can be viewed as a low order version of information computations over F , involving triplets of The mutual absolute ranks of items in A are fully independent of relative ranks of items in B , and vice versa. With our vegetables and fruits, for example, knowing that Figs is ranked rst among all six items (the absolute rank of a fruit) should give no information about whether Corn is preferred to Peas (the relative rank of vegetables). More formally, given a subset (thus, take is to minimize a dierent measure that exploits mutual variables at a time instead of information pare. If that and n-tuples. Ia;b,b , for example, reects how much the rank of a vegetable tells us about how two fruits com- A and B are rie independent, then we know Ia;b,b = 0 and Ib;a,a = 0 for any items a, a A ~ b, b B . The objective F is somewhat remi- A = {a1 , . . . , a }, let (A) denote the niscent of typical graphcut and clustering objectives. Instead of partitioning a set of nodes based on sums of pairwise similarities, we partition based on sums of tripletwise anities. We show a graphical depiction of the problem in Figure 2, where cross triplets (in vector of (absolute) ranks assigned to items in A by (A) = ((a1 ), (a2 ), . . . , (a ))). We propose to minimize an alternative objective function: F(A) I((A) ; B ()) + I((B) ; A ()), where (4.2) (dened cross , cross ) have low weight and internal triplets A,B B,A int int (in A , B ) have high weight. The objective is to nd a partition such that the sum over cross triplets is low. In fact, the problem of optimizing I denotes the mutual information by X1 and X2 DKL (P (X1 , X2 )||P (X1 )P (X2 )). We between two variables I(X1 ; X2 ) can show that F ~ F can be seen as an instance of the weighted, directed hypergraph cut problem (Gallo et al., 1993). Note that the word is guaranteed to detect ried independence: F(A) = 0 is a necessary and sucient criterion for a subset A {1, . . . , n} to be rie independent of its complement, B . Proposition 2. As with Equation 4.1, optimizing for large directed is signicant for us, because, unlike typical clustering problems, our triplets are not symmetric (for example, Ii;jk = Ij;ik ), resulting in a nonstandard and poorly understood optimization problem. F is still intractable Third-order detectability assumptions. does When n. However, it motivates a natural proxy, ~ F detect ried independence? It is not dicult Learning Hierarchical Rie Independent Groupings from Rankings to see, for example, that ~ F = 0 is a necessary con- uniqueness requirement is satised consists of the distributions for which dition for ried independence, since A m B implies A and B are strongly connected Ia;b,b = 0. We have: according to the following denition. Proposition 3. If A and B are rie independent ~ sets, then F(A) = 0. However, the converse of Prop. 3 is not true in full generality without accounting for dependencies that involve larger subsets of variables. Just as the pairwise independence assumptions (that are commonly used for randomized algorithms (Motwani & Raghavan, 1996)) rie 1 Denition 5. A subset -third-order strongly connected with A {1, . . . , n} is called . if, for every triplet i, j, k A If a set are i, j, k distinct, we have Ii;j,k > A is rie independent of order strongly B and both sets then we can third connected, ensure that ried independence is detectable from third-order terms and that the partition is unique. We have the following probabilistic guarantee. do not imply full independence between two from tripletwise marginals but sets of variables, there exist distributions which look independent do not factor upon examining higher-order terms. Nonetheless, in most practical scenarios, we expect Theorem 6. ~ F =0 to imply ried independence. Let A and B be -third order strongly connected rie independent sets, and suppose |A| = k. 2 4 4 Given S(, ) O n2 log2 n log n i.i.d. samples, Estimating argued that ~ F ~ F from samples. We have so far is a reasonable function for nding rie However, since we only have ^ the minimum of F is achieved at exactly the subsets A and B with probability at least 1 - . If independent subsets. k access to samples rather than the true distribution imation to the objective triplet of items, drawn from version of h is small compared to itself, it will only be possible to compute an approx- rive a better bound: n, then S(, ) O one can actually de- n2 2 log2 n2 log n4 . ~ F. In particular, for every from i.i.d. samples (i, j, k), we must compute an estimate Finding balanced partitions. jective function We conclude this of the mutual information Ii;j,k section with a practical extension of the basic ob- h, and the main question is: how many ~ F. In practice, like the minimum the tripletwise objective small to more balthe fact that samples will we need in order for the approximate cut objective for graphs, partitions (either anced partitions between ~ F to remain a reasonable objective function? For each triplet, we use a regularized of Equation 4.3 has a tendency to prefer In the following, we denote the estimated value of Ii;j,k by ^ Ii;j,k . |A| or |B| very small) (|A|, |B| n/2) due to unbalanced partitions have fewer triplets that cross procedure due to (Högen, 1993) to estimate mutual information. We adapt his sample complexity bound to our problem below. A and B. The simplest way to avoid this bias is to optimize the objective function over subsets of a xed size k. As we discuss in the next section, Lemma 4. For any xed triplet (i, j, k), the mutual information Ii;j,k can be estimated to within an accuracy of with probability at least 1 - using 4 n n2 i.i.d. samples and S(, ) O 2 log2 log n the same amount of time. The approximate objective function is therefore: optimizing with a xed k can be useful for building thin hierarchical rie independent models. Alternatively, one can use a modied objective function that encourages more balanced partitions. For example, we use a variation of our objective function (inspired by the normalized cut criterion (Shi & Malik, 2000)) which has proven to be useful for detecting ried independence when the size k is unknown. ^ F(A) X (i,j,k)cross A,B ^ Ii;j,k + X (i,j,k)cross B,A ^ Ii;j,k . 5. Structure discovery algorithms. Having now designed a function that is tractable to estimate from both perspectives of computational and sample complexity, we turn to the problem of learning the hierarchical rie independence structure from training examples. In this paper, we take a simple top-down approach in which the item sets are recursively partitioned by optimizing A What we want to now show is that, if there exists a unique way to partition our approximation {1, . . . , n} into rie inde- pendent sets, then given enough training examples, ^ F uniquely singles out the correct partition as its minimum with high probability. class of rie independent distributions for which the A pairwise independent family of random variables is one in which any two members are marginally independent. Subsets with larger than two members may not necessarily factor independently, however. 1 ^ F until some stopping criterion is met. The question is: can ^ F be optimized eciently? We begin by discussing a restricted class of thin models for which we can tractably optimize ^ F. Learning Hierarchical Rie Independent Groupings from Rankings Learning thin chain models. By a k -thin chain Algorithm 1 The Anchors method (1) (m) i;j,k model, we refer to a hierarchical structure in which the size of the smaller set at each split in the hierarchy is xed to be a constant be expressed as: k O(1) and can therefore Input: training set { , . . . , }, k |A|. ^ Estimate and cache I for all i, j, k; for all a , a {1, . . . , n}, a = a do 1 2 th 1 2 best a1 ,a2 (A1 m (A2 m (A3 m . . . ))), |Ai | = k, for all i. We view thin chains as being somewhat analogous to thin junction tree models (Chechetka & Guestrin, 2007), in which cliques are never allowed to have more than thin chain end for A arg min output A ; best ^ ^ I k k smallest item in {Ix;a1 ,a2 ; x = a1 , a2 }; ^ ^x;a1 ,a2 I k } ; Aa1 ,a2 {x : I ^ F(Aa1 ,a2 ); k variables. one To draw rankings from a sequentially inserts items strong connectivity as in the previous section, one can use similar arguments to derive sample complexity bounds. We remark that there are practical dierences that can at times make the anchors method somewhat less robust than an exhaustive search. Conceptually, anchoring works well when there exists two elements that are strongly connected with all of the other elements in its set, whereas an exhaustive search can work well in weaker conditions such as when items are strongly connected through longer paths. We show in our experiments that the anchors method can nonetheless be quite eective for learning hierarchies. model, independently, one group of size the full ranking. k at a time, into The structure learning problem is therefore to discover how the items are partitioned into groups, which group is inserted rst, which group is inserted second, and so on. When the size over all k is known, the optimal partitioning of Finding the global optimum of an item set can be found by exhaustively evaluating k -subsets. ^ F ^ F is therefore guaranteed at each stage of the recursion. The time complexity for nding the optimal partition from m samples is O(knk+2 +mn3 ), accounting for prek is small. computing the necessary mutual informations as well as optimization time, and is tractable if 6. Experiments Synthetic data. algorithms We rst applied our methods to synthetic data to show that, given enough samples, our Handling arbitrary partitions using anchors. When k is large, or even unknown, ^ F cannot be opti- mized using exhaustive methods. Instead, we propose a simple algorithm for nding two elements of the set do eectively recover the optimal hierarchin, we simulated A and B based on the cal structures. For various settings of data drawn jointly from a following observation. If an oracle could identify any k -thin chain model (for A, say, a1 , a2 , in advance, then the quantity Ix;a1 ,a2 = I(x; a1 < a2 ) indicates whether the item x belongs to A or B since Ix;a1 ,a2 is nonzero in the rst case, and zero in the second case. For nite training sets, when and if k = 4) with a random parameter setting for each structure and applied our exact method for learning thin chains to each sampled dataset. First, we investigated the eect of varying sample size on the proportion of trials (out of fty) for which our algorithms were able to (a) recover the full underlying tree structure I k is only known approx- imately, one can sort the set {Ix;a1 ,a2 ; x = a1 , a2 } items closest to zero exactly, of size (b) recover the topmost partition correctly, k is known, take the or (c) recover all leaf sets correctly (but possibly out of order). Figure 3(a) shows the result for an itemset to be the set B. Since we compare all items against are not known in advance, but the a1 , a2 , we refer to these two xed items as anchors. n = 16. Figure 3(b), shows, as a function of n, Of course a1 , a2 the number of samples that were required in the same experiments to (a) above method can be repeated for every pair of items as anchors to produce a collection of partitions. the approximate objective O(n2 ) exactly recover the full underlying candidate structure or (b) recover the correct leaf sets, for at least 90% of the trials. What we can observe from the plots is that, given enough samples, reliable structure recovery Each partition can then be scored using ^ F, and a nal optimal partition can be selected as the minimum over the candidates. See Algorithm 1. In cases when settings of is indeed possible. It is also interesting to k is not note that recovery of the correct leaf sets can be done with much fewer samples than are required for recovering the full hierarchical structure of the model. After learning a structure for each dataset, we learned model parameters and evaluated the log-likelihood of each model on 200 test examples drawn from the true distributions. In Figure 3(c), we compare log- known a priori, we evaluate partitions for all possible k using ^ F. can be signicantly faster (with Since the anchors method does not require searching over subsets, it O(n3 m) time complexity) than an exhaustive opti- mization of ^ F. Moreover, by assuming -third order Learning Hierarchical Rie Independent Groupings from Rankings log10(# samples required to recover tree) 1 all leaf sets recovered 5 top partition recovered -5400 1 leaf set recovery success rate true structure 0.8 Percentage of trials -5600 log-likelihood Percentage of trials 0.6 top partition recovered (green, dashed) 4 full tree recovered -5500 0.8 all leaf sets recovered full tree recovered (red, solid) 3 -5700 -5800 0.6 0.4 2 all leaf sets recovered 1 -5900 -6000 -6100 -6200 random 1-chain (with learned parameters) learned structure 0.4 0.2 full tree recovered (red, solid) 2 3 4 log10(# of samples) 5 0.2 top partition recovered (green, dashed) 2 3 4 log10(# of samples) 5 0 1 0 -6300 6 8 10 n 12 14 16 1 2 3 log10(# samples) 4 5 0 1 (a) Success rate for structure (b) Number of samples re- (c) Test set log-likelihood (d) Anchors algorithm sucrecovery vs. sample size (n = quired for structure recovery comparison cess rate (n = 16, unknown k) 16, k = 4) vs. number of items n Figure 3. Simulated data experiments likelihood performance when (a) the true structure learned with known 1 0.8 Success rate 0.6 All leaves recovered Major party leaves recovered is given (but not parameters), (b) a k -thin chain is Top partition recovered Optimized Hierarchy loglikelihood -4500 k, and (c) when we use a random generated 1-chain structure. As expected, knowing the true structure results in the best performance, and the 1-chain is overconstrained. However, our structure learning algorithm is eventually able to catch up to the performance of the true structure given enough samples. It is also interesting to note that the jump in performance at the halfway point in the plot coincides with the jump in the success rate of discovering all leaf sets correctly we conjecture that performance is sometimes less sensitive to the actual hierarchy used, as long as the leaf sets have been correctly discovered. To test the Anchors algorithm, we ran the same simulation using Algorithm 1 on data drawn from hierarchical models with no xed balanced structures, 0.4 0.2 Full tree recovered Optimized 1-chain 2001 0 76 171 389 # samples 882 -5000 50 85 144 243 412 698 1180 2000 # samples (logarithmic scale) (a) Stable features from re- (b) Test log-likelihood vs. covered hierarchies vs. sam- sample size ple size Figure 4. Irish election data experiments methods recovered the same structure in roughly the same time (shown in Figure 5), which is also the structure that (Huang & Guestrin, 2009) obtained by performing a brute-force search over partitions using the KL-divergence objective (Equation 4.1). The leaf sets of the learned hierarchy in fact reect three coalitions in the APA made up of research, clinical and community psychologists respectively. Next, we applied our algorithms to a larger Irish House of Parliament (Dáil Éireann) election dataset from the Meath constituency in Ireland. See Gormley and Murphy (2006) for more election details (including candidate names) as well as an alternative analysis. There were 14 candidates in the 2002 election belonging to the two major rival political parties, Fianna Fáil and Fine Gael, as well as a number of smaller parties. We used a subset of 2500 fully ranked ballots from the election. As with the APA data, both the exhaustive optimization of k. We generated roughly meaning that item sets were recursively partitioned into (almost) equally sized subsets at each level of the hierarchy. From Figure 3(d), we see that the Anchors algorithm can also discover the true structure given enough samples. Interestingly, the dierence in sample complexity for discovering leaf sets versus discovering the full tree is not nearly as pronounced as in Figure 3(a). We believe that this is due to the fact that the balanced trees have less depth than the thin chains, leading to fewer opportunities for our greedy top-down approach to commit errors. Election data. We now demonstrate our methods A number of electoral systems Diaco- with real datasets. around the world require voters to provide a ranking of a set of candidates in order of preference. nis (1989), for example, analyzed a 1980 presidential election of the American Psychological Association (APA) consisting of 5738 rankings of ve candidates (W. Bevan, I. Iscoe, C. Kiesler, M. Siegle, and L. Wright). We applied our methods to learning a hierarchy from the APA data. Since there are only ve Both candidates, we tried both the exhaustive optimization algorithm as well as the anchors algorithm. ^ F and the anchors algorithm returned the same tree, with running times of 69.7 seconds and 2.1 seconds respectively (not including the 3.1 seconds required for precomputing mutual informations). The resulting tree, with candidates enumerated alphabetically from 1 through 14, is shown (only up to depth 4), in Figure 6. As expected, the candidates belonging to the two major parties, Fianna Fáil and Fine Gael, are neatly partitioned into their own leaf sets. The topmost leaf is the Sinn Fein candidate, indicating Learning Hierarchical Rie Independent Groupings from Rankings {WB, II, CK, MS, LW} extra parameters towards modeling the correlations among the two major parties. As our results suggest, such intraparty ranking correlations are crucial for achieving good modeling performance. {WB, CK, MS, LW} {MS, LW} {MS, LW} 1,3 4,5 {II} 2 Community Research Figure 5. Clinical 7. Conclusion We have investigated an intuitive ranking model based on hierarchical ried independence decompositions and have shown that the structure of such hierarchies can be eciently learned from data using our proposed algorithm which can automatically nd rie independent partitions within item sets. Like Bayesian networks, cliques of variables for hierarchical ried independence models can often form intuitive groupings, such as in the election datasets that we analyzed in this paper. However, for problems in which the item groupings are nonobvious, structure learning is of fundamental importance, and we believe that our contributions in this paper open the door to further research into the problem of decomposing huge distributions into tractable pieces much like Bayesian networks have done for other distributions. Sinn Fein Hierarchy learned from APA election data {1,2,3,4,5,6,7,8,9,10,11,12,13,14} {1,2,3,4,5,6,7,8,9,10,11,13,14} {12} {11} Christian Solidarity {1,2,3,4,5,6,7,8,9,10,13,14} {1,4,13} Fianna Fáil {2,3,5,6,7,8,9,10,14} {2,5,6} Fine Gael {3,7,8,9,10,14} {3} Independent {7,8,9,10,14} {14} Labour {7,8,9,10} {7,8,9} Independent {10} Green Figure 6. Hierarchy learned from Irish election data. See (Gormley & Murphy, 2006) for a list of candidates. The candidates are ordered alphabetically in this tree. Acknowledgements This work is supported in part by the ONR under MURI N000140710747, and the Young Investigator Program grant N00014-08-1-0752. We thank K. El-Arini for feedback, and B. Murphy and C. Gormley for datasets. Discussions with M. Meila provided valuable ideas upon which this work is based. that voters tended to insert him into the ranking independently of all of the other 13 candidates. To understand the behavior of our algorithm with smaller sample sizes, we looked for features of the tree from Figure 6 which remained stable even when learning with smaller sample sizes. In Figure 4(a), we subsample from the original training set 100 times at dierent sample sizes and plot the proportion of learned hierarchies which, (a) recover the Sinn Fein candidate as the topmost leaf, (b) partition the two major parties into leaf sets, and (c) agree with the original tree on all leaf sets. Note that even with few training examples, candidates belonging to the major parties are correctly grouped together indicating strong party inuence in voting behavior. Finally, we compared the results between learning a general hierarchy (without xed k) and learning Figure 4(b) a 1-thin chain model on the Irish data. shows the log-likelihoods achieved by both models on a held-out test set as the training set size increases. For each training set size, we subsampled the Irish dataset 100 times to produce condence intervals. Again, even with small sample sizes, the hierarchy outperforms the 1-chain and continually improves with more and more training data. One might think that the hierarchical models, which use more parameters are prone to overtting, but in practice, the models learned by our algorithm devote most of the Chechetka, A. and Guestrin, C. Ecient principled learning of thin junction trees. In Advances in Neural Information Processing Systems 21, 2007. Diaconis, P. A generalization of spectral analysis with application to ranked data. The Annals of Statistics, 17 (3):949979, 1989. Gallo, G., Longo, G., Pallottino, S., and Nguyen, S. Directed hypergraphs and applications. Discrete Applied Math., 42(2-3), 1993. Gormley, C. and Murphy, B. A latent space model for rank data. In 23rd International Conference on Machine Learning, 2006. Högen, K. U. Learning and robust learning of product distributions. In Sixth Annual Conference on Learning Theory, 1993. Huang, J. and Guestrin, C. Ried independence for ranked data. In Advances in Neural Information Processing Systems 23, 2009. Huang, J., Guestrin, C., Jiang, X., and Guibas, L. Exploiting probabilistic independence for permutations. In Articial Intelligence and Statistics, JMLR W&CP 5, 2009. Motwani, R. and Raghavan, P. Randomized algorithms. ACM Computational Survey, 28(1), 1996. Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions in Pattern Analysis and Machine Intelligence, 22(8), 2000. References