Mixed Membership Stochastic Blockmodels Edoardo M. Airoldi 1,2 , David M. Blei 1 , Stephen E. Fienberg 3,4 & Eric P. Xing 4 1 Department of Computer Science, 2 Lewis-Sigler Institute, Princeton University 3 Department of Statistics, 4 School of Computer Science, Carnegie Mellon University eairoldi@Princeton.EDU Abstract In many settings, such as protein interactions and gene regulatory networks, collections of author-recipient email, and social networks, the data consist of pairwise measurements, e.g., presence or absence of links between pairs of objects. Analyzing such data with probabilistic models requires non-standard assumptions, since the usual independence or exchangeability assumptions no longer hold. In this paper, we introduce a class of latent variable models for pairwise measurements: mixed membership stochastic blockmodels. Models in this class combine a global model of dense patches of connectivity (blockmodel) and a local model to instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodel with applications to social networks and protein interaction networks. 1 Introduction The problem of modeling relational information among objects, such as pairwise relations represented as graphs, arises in a number of settings in machine learning. For example, scientific literature connects papers by citation, the Web connects pages by links, and protein-protein interaction data connects proteins by physical interaction records. In these settings, we often wish to infer hidden attributes of the objects from the observed measurements on pairwise properties. For example, we might want to compute a clustering of the web-pages, predict the functions of a protein, or assess the degree of relevance of a scientific abstract to a scholar's query. Unlike traditional attribute data collected over individual objects, relational data violate the classical independence or exchangeability assumptions made in machine learning and statistics. The objects are dependent by their very nature, and this interdependence suggests a different set of assumptions is more appropriate. Recently proposed models aim at resolving relational information into a collection of connectivity patterns. Such models are based on assumptions that often ignore useful technical necessities, or important empirical regularities. For instance, exponential random graph models [11] summarize the variability in a collection of paired measurements with a set of relational motifs, but do not provide unit-specific representations useful for making predictions. Latent space models [4] project individual units of analysis into a low-dimensional latent space, but do not provide a group structure into such space useful for clustering. Stochastic blockmodels [8, 6] resolve paired measurements into groups and connectivity between pairs of groups, but constrain each unit to instantiate the connectivity patterns of a single group as observed in most applications. Mixed membership models, such as latent Dirichlet allocation [1], have emerged in recent years as a flexible modeling tool for data where the single group assumption is violated by the heterogeneity within a unit of analysis-- e.g., a document, or a node in a graph. They have been successfully applied in many domains, such as document analysis [1], image processing [7], and population genetics [9]. Mixed membership A longer version of this work is available online, at http://arxiv.org/abs/0705.4485/ 1 models associate each unit of analysis with multiple groups rather than a single groups, via a membership probability-like vector. The concurrent membership of a data in different groups can capture its different aspects, such as different underlying topics for words constituting each document. The mixed membership formalism is a particularly natural idea for relational data, where the objects can bear multiple latent roles or cluster-memberships that influence their relationships to others. Existing mixed membership models, however, are not appropriate for relational data because they assume that the data are conditionally independent given their latent membership vectors. Conditional independence assumptions that technically instantiate mixed membership in recent work, however, are inappropriate for the relational data settings. In such settings, an objects is described by its relationships to others. Thus assuming that the ensemble of mixed membership vectors help govern the relationships of each object would be more appropriate. Here we develop mixed membership models for relational data and we describe a fast variational inference algorithm for inference and estimation. Our model captures the multiple roles that objects exhibit in interaction with others, and the relationships between those roles in determining the observed interaction matrix. We apply our model to protein interaction and social networks. 2 The Basic Mixed Membership Blockmodel Observations consist of pairwise measurements, represented as a graph G = (N , Y ), where Y (p, q ) denotes the measurement taken on the pair of nodes (p, q ). In this section we consider observations consisting of a single binary matrix, where Y (p, q ) {0, 1}, i.e., the data can be represented with a directed graph. The model generalizes to two important settings, however, as we discuss below--a collection of matrices and/or other types of measurements. We summarize a collection of pairwise measurements with a mapping from nodes to sets of nodes, called blocks, and pairwise relations among the blocks themselves. Intuitively, the inference process aims at identifying nodes that are similar to one another in terms of their connectivity to blocks of nodes. Similar nodes are mapped to the same block. Individual nodes are allowed to instantiate connectivity patterns of multiple blocks. Thus, the goal of the analysis with a Mixed Membership Blockmodel (MMB) is to identify (i) the mixed membership mapping of nodes, i.e., the units of analysis, to a fixed number of blocks, K , and (ii) the pairwise relations among the blocks. Pairwise measurements among N nodes are then generated according to latent distributions of block-membership for each node and a matrix of blockto-block interaction strength. Latent per-node distributions are specified by simplicial vectors. Each node is associated with a randomly drawn vector, say i for node i, where i,g denotes the probability of node i belonging to group g . In this fractional sense, each node can belong to multiple groups with different degrees of affiliation. The probabilities of interactions between different groups are defined by a matrix of Bernoulli rates B(K ×K ) , where B (g , h) represents the probability of having a connection from a node in group g to a node in group h. The indicator vector zpq denotes the specific block membership of node p when it connects to node q , while zpq denotes the specific block membership of node q when it is connected from node p. The complete generative process for a graph G = (N , Y ) is as follows: · For each node p N : ­ Draw a K dimensional mixed membership vector p Dirichlet · For each pair of nodes (p, q ) N × N : ­ Draw membership indicator for the initiator, zpq Multinomial p . . . ­ Draw membership indicator for the receiver, zqp Multinomial q z . ­ Sample the value of their interaction, Y (p, q ) Bernoulli pq B zpq Note that the group membership of each node is context dependent, i.e., each node may assume different membership when interacting with different peers. Statistically, each node is an admixture of group-specific interactions. The two sets of latent group indicators are denoted by {zpq : p, q N } =: Z and {zpq : p, q N } =: Z . Further, the pairs of group memberships that underlie interactions, e.g., (zpq , zpq ) for Y (p, q ), need not be equal; this fact is useful for characterizing asymmetric interaction networks. Equality may be enforced when modeling symmetric interactions. The joint probability of the data Y and the latent variables {1:N , Z , Z } sampled according to 2 z 11 1 z 12 z 12 y 12 z 13 z 13 ... y 13 z 1N z 1N y 1N z 11 y 11 z 21 2 z 22 z 22 y 22 z 23 z 23 ... y 23 z 2N z 2N y 2N B z 21 y 21 Figure 1: A graphical model of the mixed membership blockmodel (MMB). We did not draw all the arrows out of the block model B for clarity. All the pairwise measurements, Y (p, q ), depend on it. z 31 3 z 32 z 32 . . . y 32 z 33 . . . z N2 z N2 y N2 z 11 z 33 ... y 33 . . z 3N z 3N . . . z NN ... z NN y NN y 3N z 31 . . . . . . y 31 . z N1 n z N3 y N3 z N1 y N1 the MMB is: p(Y , 1:N , Z , Z |, B ) = p ,q P (Y (p, q )|zpq , zpq , B )P (zpq |p )P (zpq |q ) p P (p |). Introducing Sparsity. Adjacency matrices encoding binary pairwise measurements often contain a large amount of zeros, or non-interactions; they are sparse. It is useful to distinguish two sources of non-interaction: they may be the result of the rarity of interactions in general, or they may be an indication that the pair of relevant blocks rarely interact. In applications to social sciences, for instance, nodes may represent people and blocks may represent social communities. In this setting, it is reasonable to expect that a large portion of the non-interactions is due to limited opportunities of contact between people in a large population, or by design of the questionnaire, rather than due to deliberate choices, the structure of which the blockmodel is trying to estimate. It is useful to account for these two sources of sparsity at the model level. A good estimate of the portion of zeros that should not be explained by the blockmodel B reduces the bias of the estimates of its elements. A sparsity parameter [0, 1] can be introduced in the model above to characterize the source of non-interaction. Instead of sampling a relation Y (p.q ) directly the Bernoulli with parameter specified as above, we down-weight the probability of successful interaction to (1 - ) · zpq B zpq . This is the result of assuming that the probability of a non-interaction comes from a mixture, 1 - pq = (1 - ) · zpq (1 - B ) zpq + , where the weight capture the portion zeros that should not be explained by the blockmodel B . A large value of will cause the interactions in the matrix to be weighted more than non-interactions, in determining plausible values for {, B , 1:N }. Recall that {, B } are constant quantities to be estimated, while {1:N , Z , Z } are unknown variable quantities whose posterior distribution needs to be determined. Below, we detail the variational expectation-maximization (EM) procedure to carry out approximate estimation and inference. 2.1 Variational E-Step During the E-step, we update the posterior distribution over the unknown variable quantities {1:N , Z , Z }. The normalizing constant of the posterior is the marginal probability of the data, which requires an intractable integral over the simplicial vectors p , z p(Y | , B ) = p(Y , 1:N , Z , Z |, B ). (1) 1:N pq ,zpq We appeal to mean-field variational methods [5] to approximate the posterior of interest. The main idea behind variational methods is to posit a simple distribution of the latent variables with free parameters, which are fit to make the approximation close in Kullback-Leibler divergence to the true posterior of interest. The log of the marginal probability in Equation 1 can be bound as follows, l - l , log p(Y | , B ) Eq og p(Y , 1:N , Z , Z |, B ) Eq og q (1:N , Z , Z ) (2) by introducing a distribution of the latent variables q that depends on a set of free parameters. We specify q as the mean-field fully-factorized family, q (1:N , Z , Z | 1:N , , ), where 3 {1:N , , } are the set of free variational parameters that can be set to tighten the bound. We then tighten the bound with respect to the variational parameters, to minimize the KL divergence between q and the true posterior. The update for the variational multinomial parameters is B l ·h 1 1-Y (p,q) pq,h Y (p,q ) ^pq,g e Eq og p,g (g , h) · - B (g , h) (3) ^ pq,h e Eq l og q,h ·g B (g , h)Y (p,q) · 1 - B (g , h) 1-Y (p,q) pq,g , (4) for g , h = 1, . . . , K . The update for the variational Dirichlet parameters p,k is q q p,k = k + ^ pq,k + pq,k , for all nodes p = 1, . . . , N and k = 1, . . . , K . (5) Nested Variational Inference. To improve convergence, we employed a nested variational inference scheme based on an alternative schedule of updates to the traditional ordering. In a na¨ve i iteration scheme for variational inference, one initializes the variational Dirichlet parameters 1:N and the variational multinomial parameters (pq , pq ) to non-informative values, and then iterates until convergence the following two steps: (i) update pq and pq for all edges (p, q ), and (ii) update p for all nodes p N . At each variational inference cycle we allocate N K + 2N 2 K scalars. In our experiments, the na¨ve variational algorithm often failed to converge, or converged i only after many iterations. We attribute this behavior to the dependence between 1:N and B , which is not accounted for by the na¨ve algorithm. The nested variational inference algorithm retains pori tion of this dependence across iterations by choosing a particular path to convergence. We simply keep the block of free parameters (pq , pq ) at an optimal solution given the other variational parameters. These parameters are involved in the updates of parameters in 1:N and in B , thus effectively providing a channel to maintain some dependence among them. From a computational perspective, the nested algorithm trades time for space thus allowing us to deal with large graphs. At each variational cycle we allocate N K + 2K scalars only. The algorithm can be parallelized, and, empirically, it leads to a better local optimum of the likelihood bound per unit of running time. 2.2 M-Step During the M-step, we maximize the lower bound in Equation 2, used as a surrogate for the likelihood, with respect to the unknown constants {, B }. In other words, we compute the empirical Bayes estimates of the hyper-parameters. The M-step is equivalent to finding the MLE using expected sufficient statistics under the variational distribution. We consider the maximization step for each parameter in turn. A closed form solution for the approximate maximum likelihood estimate of does not exist. We used linear-time Newton-Raphson, with gradient and Hessian +p k - k , L =N k (k ) (p,k ) - p,k and k I , L ( k =N k (k1 =k2 ) · k1 ) - k1 k2 respectively. The approximate MLE of B is p ,q Y (p, q ) · pq g pq h ^ p B (g , h) = , ,q pq g pq h for every pair (g , h) [1, K ]2 . Finally, the approximate MLE of the sparsity parameter is 1 p · g p - Y (p, q ) ,q ,h pq g pq h g = ^ . ,q ,h pq g pq h 4 (6) (7) Alternatively, we can fix prior to the analysis; the density of the interaction matrix is estimated p 2 ^ ^ with d = ~ ,q Y (p, q )/N , and the sparsity parameter is set to = (1 - d). This latter estimator attributes all the information in the non-interactions to the point mass, i.e., to latent sources other than the block model B or the mixed membership vectors 1:N . It does however provide a quick recipe to reduce the computational burden during exploratory analyses. Several model selection strategies exist for hierarchical models. In our setting, model selection translates into the choice of the number of blocks, K . Below, we chose K with held-out likelihood in a cross-validation experiment, on large networks, and with approximate BIC, on small networks. 2.3 Summarizing and De-Noising Pairwise Measurements It is useful to consider two perspectives when analyzing data with MMB: (i) we summarize the data, Y , in terms of the global blockmodel, B , and the node-specific mixed memberships, s, (ii) we denoise the data, Y , in terms of the global blockmodel, B , and interaction-specific single memberships, Z s. In both cases the model depends on a small set of unknown constants to be estimated: , and B . The likelihood is the same in both cases, although, the rationale for including the set of latent variables Z s differ. When summarizing data, we could integrate out the Z s analytically; this leads to numerical optimization of a smaller set of variational parameters, s. We choose to keep the Z s to simplify inference. When de-noising, the Z s are instrumental in estimating posterior expectations of each interactions individually--a network analog to the Kalman Filter. The posterior expectations of an interaction is computed as p B q , and pq B pq , in the two cases. 3 Empirical Results We evaluated the MMB on simulated data and on three collections of pairwise measurements. Results on simulated data show that variational EM accurately recovers the mixed membership map, 1:N , and the blockmodel, B , when data are sampled accordingly to the model. Cross-validation suggests an accurate estimate for K . Nested variational scheduling of parameter updates makes inference parallelizable and a typically reaches a better solution than the na¨ve scheduling. i First we consider, whom-do-like relations among 18 novices in a New England monastery. The unsupervised analysis demonstrates the type of patterns that MMB recovers from data, and allows us to contrast the summaries of the original measurements achieved through prediction and de-noising. The data was collected by Sampson during his stay at the monastery, while the novices were preparing to join a monastic order [10]. Sampson's original analysis was rooted in direct anthropological observations. He strongly suggested the existence of tight factions among the novices: the loyal opposition (whose members joined the monastery first), the young turks (who joined later on), the outcasts (who were not accepted in the two main factions), and the waverers (who did not take sides). The events that took place during Sampson's stay at the monastery supported his observations-- members of the young turks resigned or were expelled over religious differences (John and Gregory). Scholars in the social sciences typically regard the labels assigned by Sampson to the novices (and his conclusions, more in general) as ground truth to the extent of assessing the quality of results of quantitative analyses; we shall do the same. Using the variational EM algorithm above, we fit an array of mixed membership blockmodels with different values of K , and collected model estimates {, B } and posterior mixed membership vectors 1:18 for the novices. We used an approximation ^^ ^ of BIC to choose the value of K supported by the data. This criterion selects K = 3, the same number of proper groups that Sampson identified based on anthropological observations--the waverers are interstitial members, rather than a group. Figure 2 shows the patterns that the mixed ^ membership blockmodel with K = 3 recovers from data. In particular, the top-left panel shows a ^ graphical representation of the blockmodel B . The block that we can identify a-posteriori with the loyal opposition is portrayed as central to the monastery, while the block identified with the outcasts shows the lowest internal coherence, in accordance with Sampson's observations. The top-right panel illustrates the posterior means of the mixed membership scores, E[|Y ], for the 18 monks in the monastery. The monks cluster according to Sampson's classification, with Young Turks, Loyal Opposition, and Outcasts dominating each corner respectively. We can see the central role played by John Bosco and Gregory, who exhibit relations in all three groups, as well as the uncertain affili5 ^ Figure 2: Top-Left: Estimated blockmodel, B . Top-Right: Posterior mixed membership vectors, ^ 1:18 , projected in the simplex. The estimates correspond to a model with B top-left, and = 0.058. ^ Numbered points can be mapped to monks' names using the legend on the right. The colors identify the four factions defined by Sampson's anthropological observations. Bottom: Original adjacency matrix of whom-do-like sociometric relations (left), relations predicted using approximate MLEs for 1:N and B (center), and relations de-noised using the model including Z s indicators (right). ations of Ramuald and Victor; Amand's uncertain affiliation, however, is not captured. The bottom panels contrast the different resolution of the original adjacency matrix of whom-do-like sociometric relations (left panel) obtained in different uses of MMB. If the goal of the analysis if to find a parsimonious summary of the data, the amount of relational information that is captured by in , B , ^^ and E[|Y ] leads to a coarse reconstruction of the original sociomatrix (central panel). If the goal of the analysis if to de-noising a collection of pairwise measurements, the amount of relational information that is revealed by , B , E[|Y ] and E[Z , Z |Y ] leads to a finer reconstruction of the ^^ original sociomatrix, Y --relations in Y are re-weighted according to how much they make sense to the model (right panel). In conclusion, the unsupervised analysis of the sociometric relations with MMB offers quantitative support to several of Sampson's anthropological observations. Second, we consider a friendship network among a group of 69 students in grades 7­12. The analysis here directly compares clustering results obtained by MMB to published clustering results obtained by competing models, in a setting where a fair amount of social segregation is expected [2, 3]. The data is a collection of friendship relations among 69 students in one of the 80 schools surveyed in the National Study of Adolescent Health. The original population in the school of interest consisted of 71 students. Two students expressed no friendship preferences and were excluded from the analysis. We used variational EM algorithm to fit an array of mixed membership blockmodels with different values of K , collected model estimates, and used an approximation to BIC to select K . This ^ procedure identifies K = 6 as the model-size that best explains the data; note that six is the number of grade-groups in the student population. The blocks are clearly interpretable a-posteriori in terms of grades, thus providing a mapping between grades and blocks. Conditionally on such a mapping, we assign students to the grade they are most associated with, according to their posterior-mean mixed membership vectors, E[n |Y ]. To be fair in the comparison with competing models, we assign students with a unique grade--despite MMB allows for mixed membership. Table 1 computes the correspondence of grades to blocks by quoting the number of students in each grade-block pair, for MMB versus the mixture blockmodel (MB) in [2], and the latent space cluster model (LSCM) in [3]. The higher the sum of counts on diagonal elements is the better is the correspondence, while the higher the sum of counts off diagonal elements is the worse is the correspondence. MMB performs best by allocating 63 students to their grades, versus 57 of MB, and 37 of LSCM. Correspondence only partially partially captures goodness of fit, however, it is a good metric in the setting we consider, where a fair amount of clustering is present. Moreover, the extra-flexibility MMB offers over 6 Grade 7 8 9 10 11 12 1 13 0 0 0 0 0 2 1 9 0 0 0 0 MMB Clusters 3 4 5 0 0 0 2 0 0 16 0 0 0 10 0 1 0 11 0 0 0 6 0 1 0 0 1 4 1 13 0 0 0 0 0 2 1 10 0 0 0 0 MB Clusters 3 4 5 0 0 0 2 0 0 10 0 0 0 10 0 1 0 11 0 0 0 6 0 0 6 0 1 4 1 13 0 0 0 0 0 LSCM Clusters 2 3 4 5 1 0 0 0 11 1 0 0 0 7 6 3 0 0 0 3 0 0 0 3 0 0 0 0 6 0 0 0 7 10 4 Table 1: Grade levels versus (highest) expected posterior membership for the 69 students, according to three alternative models. MMSB is the proposed mixed membership stochastic blockmodel, MSB is the mixture blockmodel in [2], and LSCM is the latent space cluster model in [3]. MB and LSCM reduces bias in the prediction of the membership of students to blocks. In other words, mixed membership does not absorb noise in this example, rather it accommodates variability in the friendship relation that is instrumental in producing better predictions. Third, we consider physical interactions among 871 proteins in yeast. The analysis allows us to evaluate the utility of MMB in summarizing and de-noising complex connectivity patterns quantitatively, using an independent set of functional annotations--consider two models that suggest different sets of interactions as reliable; we prefer the model that reveals functionally relevant interactions. Pairwise measurements consist of a hand-curated collection of physical protein interactions made available by the Munich Institute of Protein Sequencing (MIPS). The yeast genome database provides independent functional annotations for each protein, which we use for evaluating the functional content of the protein networks derived with the MMB from the MIPS data, as detailed below. We explored a large model space, K = 2 . . . 225, and used five-fold cross-validation to identify a blockmodel B that would reduce the dimensionality of the physical interactions among proteins in the training set, while revealing robust aspects of connectivity that could be leveraged to predict physical interactions among proteins in the test set. We determined that a fairly parsimonious model, K = 50, provides a good description of the observed physical interaction network. This finding supports the hypothesis that proteins derived from the MIPS data are interpretable in terms functional biological contexts. Alternatively, the blocks might encode signal at a finer resolution, such as that of protein complexes. If that was the case, however, we would expect the optimal number of blocks to be significantly higher; 871/5 175, given an average size of five proteins in a protein complex. At this stage, we evaluated the functional content of in the posterior induced by MMB. The goal is to assess to what extent MMB reveals substantive information about the functionality of proteins that can be used to inform subsequent analyses. To do this, first, we fit a model on the whole data set to estimate the blockmodel, B(50×50) , and the mixed membership vectors between proteins and blocks, 1:871 , and second, we either impute physical interactions by thresholding the posterior expectations computed using blockmodel and mixed membership map (prediction task), or we de-noise the observed interactions using the blockmodel and mixed membership map (de-noising task). Posterior expectations of each interaction are in [0, 1]. Thresholding such expectations at q , for instance, leads to a collection of binary physical interactions that are at reliable with probability p q . We use the independent set of functional annotations from the yeast database to decide which interactions are Precision MMB (K=50; MIPS de-noised with Zs & B) MMB (K=50; MIPS summarized with s & B) Recall (unnormalized) Figure 3: Functional content of the MIPS collection of protein interactions (yellow diamond) on a precision-recall plot, compared against other published collections of interactions and microarray data, and to the posterior estimates of the MMB models--computed as described in the text. 7 functionally meaningful; namely those between pairs of proteins that share at least one functional annotation. In this sense, between two models that suggest different sets of interactions as reliable, our evaluation assigns a higher score to the model that reveals functionally relevant interactions. Figure 3 shows the functional content of the original MIPS collection of physical interactions (point no.2), and of the collections of interactions computed using (B , s), the light blue (-×) line, and using (B , Z s), the dark blue (-+) line, thresholded at ten different levels--precision-recall curves. The posterior means of s provide a parsimonious representation for the MIPS collection, and lead to precise interaction estimates, in moderate amount (-× line). The posterior means of Z s provide a richer representation for the data, and describe most of the functional content of the MIPS collection with high precision (-+ line). Most importantly, notice the estimated protein interaction networks, i.e., ex-es and crosses, corresponding to lower levels of recall feature a more precise functional content than the original. This means that the proposed latent block structure is helpful in summarizing the collection of interactions--by ranking them properly. On closer inspection, dense blocks of predicted interactions contain known functional predictions that were not in the MIPS collection, thus effectively improving the quality of the data that instantiate activity specific to few biological contexts, such as biopolymer catabolism and homeostasis. In conclusion, results suggest that MMB successfully reduces the dimensionality of the data, while revealing substantive information about the multiple functionality of proteins that can be used to inform subsequent analyses. Remarks. A. In the relational setting, cross-validation is feasible if the blockmodel estimated on training data can be expected to hold on test data; for this to happen the network must be of reasonable size, so that we can expect members of each block to be in both training and test sets. In this setting, scheduling of variational updates is important; nested variational scheduling leads to efficient and parallelizable inference. B. MMB includes two sources of variability, B , s, that are apparently in competition for explaining the data, possibly raising an identifiability issue. This is not the case, however, as the blockmodel B captures global/asymmetric relations, while the mixed membership vectors s capture local/symmetric relations. This difference practically eliminates the issue, unless there is no signal in the data to begin with. C. MMB generalizes to two important cases. First, multiple data collections Y1:M on the same objects can be generated by the same latent vectors. This might be useful, for instance, for analyzing multivariate sociometric relations simultaneously. Second, in the MMSB the data generating distribution is a Bernoulli, but B can be a matrix of parameterizes for any kind of distribution. For instance, technologies for measuring interactions between pairs of proteins, such as mass spectrometry and tandem affinity purification, which return a probabilistic assessment about the presence of interactions, thus setting the range of Y [0, 1]. References [1] D. M. Blei, A. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993­1022, 2003. [2] P. Doreian, V. Batagelj, and A. Ferligoj. Discussion of "Model-based clustering for social networks". Journal of the Royal Statistical Society, Series A, 170, 2007. [3] M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based clustering for social networks. Journal of the Royal Statistical Society, Series A, 170:1­22, 2007. [4] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97:1090­1098, 2002. [5] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical models. Machine Learning, 37:183­233, 1999. [6] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proc. of the 21st National Conference on Artificial Intelligence, 2006. [7] F.-F. Li and P. Perona. A Bayesian hierarchical model for learning natural scene categories. IEEE Computer Vision and Pattern Recognition, 2005. [8] K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96:1077­1087, 2001. [9] J. K. Pritchard, M. Stephens, N. A. Rosenberg, and P. Donnelly. Association mapping in structured populations. American Journal of Human Genetics, 67:170­181, 2000. [10] F. S. Sampson. A Novitiate in a period of change: An experimental and case study of social relationships. PhD thesis, Cornell University, 1968. 8 [11] S. Wasserman, G. Robins, and D. Steinley. A brief review of some recent research. In: Statistical Network Analysis: Models, Issues and New Directions, Lecture Notes in Computer Science. Springer-Verlag, 2007. 9