Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft A STACKED GRAPHICAL MODEL FOR ASSOCIATING SUB-IMAGES WITH SUB-CAPTIONS ZHENZHEN KOU, WILLIAM W. COHEN, AND ROBERT F. MURPHY Machine Learning Department, Carnegie Mel lon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA E-mail: zkou@andrew.cmu.edu, wcohen@cs.cmu.edu, murphy@cmu.edu There is extensive interest in mining data from full text. We have built a system called SLIF (for Subcellular Lo cation Image Finder), which extracts information on one particular asp ect of biology from a combination of text and images in journal articles. Asso ciating the information from the text and image requires matching sub-figures with the sentences in the text. We introduce a stacked graphical mo del, a meta-learning scheme to augment a base learner by expanding features based on related instances, to match the labels of sub-figures with labels of sentences. The experimental results show a significant improvement in the matching accuracy of the stacked graphical mo del (81.3%) as compared with a relational dep endency network (70.8%) or the current algorithm in SLIF (64.3%). 1. Intro duction The vast size of the biological literature and the knowledge contained therein makes it essential to organize and summarize pertinent scientific results. Biological literature mining has been increasingly studied to extract information from huge amounts of biological articles 1 ­ 3 . Most of the existing IE systems are limited to extracting information only from text. Recently there has been great interest in mining from both text and image. Yu and Lee4 designed BioEx that analyses abstract sentences to retrieve the image in an article. Rafkind et al5 explored the classification of general bioscience images into generic categories based on features from both text (image caption) and image. Shatkay et al6 described a method to obtain features from images to categorize biomedical documents. We have built a system called SLIF7 ­ 8 (for Subcellular Location Image Finder) that extracts information about protein subcellular locations from both text and images. SLIF analyzes figures in biological papers, which include both images and captions. In SLIF, a large corpus of articles is fully analyzed 1 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft Fig. 5. Double immunofluorescence confocal microscopy using mouse mAb against cPABP and affinity-purified rabbit antibodies against mrnp 41. Methanol-permeabilized and fixed HeLa cells were incubated with affinity-purified rabbit anti-mrnp 41 antibodies (a) and with monoclonal anti-cPAPB antibodies (b), and the bound antibodies were visualized with fluorescently labeled secondary antibodies. (Bar = 10 µm.) Figure 1. A figure caption pair reproduced from the biomedical literature. and the results of analysis steps are stored in an SQL database as traceable assertions. An interface to the database (http://slif.cbi.cmu.edu) has been designed such that images and text of interest can be retrieved and presented to users7 . In a system mining both text and images, associating the information from the text and the image is very challenging since usually there are multiple sub-figures in a figure and we must match sub-figures with the sentences in the text. In the initial version of SLIF, we extracted the labels for the sub-figures and sentences separately and matched them by finding equal-value pairs. This naive matching approach ignores much context information, i.e., the labels for sub-figures are usually a sequence of letters and people assign labels in a particular order rather than randomly, and could only achieve a matching accuracy of 64.3%. To obtain a satisfactory matching accuracy the naive approach requires high-accuracy image analysis and text analysis to get the labels. However, extracting labels from image is non-trivial. Inferring the label sequences and improving image processing allowed us to increase the F1 for panel label extraction to 78%9 . In this paper, we introduce a stacked graphical model to match the labels of sub-figures with labels of sentences. The stacked model can take advantage of the context information and achieves an 81.3% accuracy. In the following, we give a brief review of SLIF in Section 2. Section 3 describes the stacked model used for the matching. Section 4 summarizes the experimental results and Section 5 concludes the paper. 2. SLIF Overview SLIF applies both image analysis and text interpretation to figures. Figure 1a is a typical figure that SLIF can analyse. a This figure is repro duced from the article "mRNA binding protein mrnp 41 localizes to both nucleus and cytoplasm", by Doris Kraemer and Gunter Blob el, Cell Biology Vol. ¨ 2 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft Caption Caption understanding Figure [Cohen et al, 2003] Scope Entity matching & extraction [Cohen et al 2003; Kou et al, 2005] Annotated Scopes proteins, cells ImagePtr Label Matching Panel labels [Kou et al 2003; current work] SLIF DB [Murphy et al, 2004] Panel splitting [Murphy et al, 2001] Label finding [Kou et al 2003] Panels Panel classification, Micrograph analysis [Murphy et al, 2001] Annotated Panels image type, image scale, subcellular location features... Figure 2. Overview of the image and text processing steps in SLIF. Figure 2 shows an overview of the steps in the SLIF system with references to publications in which they are described in more details. Image processing includes several steps: Decomp osing images into panels. For images containing multiple panels, the individual panels are recovered from the image. Identifying fluorescence microscop e images. Panels are classified as to whether they are fluorescence microscope images, so that appropriate image processing steps can be performed. Image prepro cessing and feature computations. Firstly the annotations such as labels, arrows and indicators of scale contained within the image are detected, analyzed, and then removed from the image. In this step, panel labels are recognized by Optical Character Recognition (OCR). Panel labels are textual labels which appear as annotations to images, for example, "a" and "b" printed in panels in Figure 1. Recognizing panel labels is very challenging. Even after careful image pre-processing and enhancement the F1 accuracy is only about 75%. The OCR results are used as candidate panel labels and after filtering candidates an F1 accuracy of 78% is obtained9 . Secondly, the scale bar is extracted, and finally subcellular location features (SLFs) are produced and the localization pattern of each cell is determined. Caption Processing is done as follows. Entity name extraction. In the current version of SLIF we use an extractor trained on conditional random fields10 and an extractor trained on Dictionary-HMMs11 to extract the protein name. The cell name is extracted using hand-coded rules. Image 94, pp. 9119-9124, August 1997. 3 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft p ointer extraction. The linkage between the panels and the text of captions is usually based on textual labels which appear as annotations to the images (i.e., panel labels), and which are also interspersed with the caption text. We call these textual labels appearing in text image pointers, for example, "(a)" and "(b)" in the caption in Figure 1. In our analysis, image pointers are classified into four categories according to their linguistic function: Bul let-style image pointers, NP-style image pointers, Citation-style image pointers, and other12 . The image-pointer extraction and classification steps are done via a machine learning method12 . Entity to image p ointer alignment. The scope of an image pointer is the section of text (sub-caption) that should be associated with it. The scope is determined by the class assigned to an image pointer.12 3. A Stacked Mo del to Map Panel Lab els to Image Pointers 3.1. Stacked Graphical Models for Classification Stacked graphical models are a meta-learning scheme to do collective classification13 , in which a base learner is augmented by expanding one instance's features with predictions on other related instances. Stacked graphical models work well on predicting labels for relational data with graphical structures (Kou and Cohen, in preparation). The inference converges much faster than the traditional Gibbs sampling method and it has been shown empirically that one iteration of stacking is able to achieve good performance on many tasks. The disadvantage of stacking is that it requires more training time to achieve faster testing inference. Figure 3 shows the inference and learning methods for stacked graphical models. In a stacked graphical model, the relational template C finds the related instances. For instance xi , C (xi ) retrieves the indices i1 , ..., iL of ^ instances xi1 , ..., xiL that are related to xi . Given predictions y for a set of ^ instances x, C (xi , y) returns the predictions on the related instances, i.e., ^ yi1 , ..., yiL . ^ The idea of stacking is to take advantage of the dependencies among instances, or the relevance between inter-related tasks. In our application in this paper, we conjecture that panel label extraction and image pointer extraction are inter-related, and design a stacked model that combines them. 3.2. A Stacked Model for Mapping In the previous version of SLIF, we map panel labels to image pointers by finding the equal-value pair. Below we apply the idea of stacked graphical 4 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft · Parameters: a relational template C and a cross-validation parameter J. · Learning algorithm: Given a training set D = {(x, y)} and a base learner A: ­ Learn the local model, i.e., when k = 0: Return f 0 = A(D0 ). Please note that D0 = D, x0 = x, y0 = y. ­ Learn the stacked models, for k = 1...K : ^ (1) Construct cross-validated predictions yk-1 for x D as follows: (a) Split D into J equal-sized disjoint subsets D1 ...DJ . k k (b) For j = 1...J , let fj -1 = A(Dk-1 - Dj -1 ). k-1 k-1 ^ (c) For x Dj , yk-1 = fj (x ). (2) Construct an extended dataset Dk = (xk , y) by converting each instance xi to xk as follows: xk = i i ^ ^ (xi , C (xi , yk-1 )), where C (xi , yk-1 ) will return the predictions for examples related to xi such that xk = i (xi , yi1-1 , ..., yiL 1 ). ^k ^k- (3) Return f k = A(Dk ). · Inference algorithm: given x : ^ (1) y0 = f 0 (x). For k = 1...K , (2) Carry out Step 2 above to produce xk . (3) yk = f k (xk ). Return yK . Figure 3. Stacked Graphical Learning and Inference models to map the panel labels and image pointers. In SLIF the image pointer finding was done as follows. Most image pointers are parenthesized, and relatively short. We thus hand-coded an extractor that finds all parenthesized expressions that are (a) less than 15 characters long and (b) do not contain a nested parenthesized expression, and replaces X-Y constructs with the equivalent complete sequence. (E.g., constructs like "B-D" are replaced with "B,C,D".) We call the image pointers extracted by this hand-coded approach candidate image pointers. The hand-coded extractor has high recall but only moderate precision. Using 5 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft a classifier trained with machine learning approaches, we then classify the candidate image pointers as bullet-style, citation-style, NP-style, or other. Image pointers classified as "other" are discarded, which compensates for the relatively low precision of the hand-coded extractor.12 In SLIF the panel label extraction was done as follows. Image processing techniques and OCR techniques are applied to find the labels printed within the panel. That is, firstly candidate text regions are computed via image processing techniques, and OCR is run on these candidate regions to get candidate panel labels. This approach has a relatively high precision yet low recall. We call the panel labels recognized by image processing and OCR candidate panel labels. A strategy based on grid analysis (a procedure which analyzes how many panels there are in a figure and finds out how the panels are ranged) is applied to the candidate panel labels to get a better accuracy.9 The match between panels labels and image pointers can be formulated as a classification problem. We construct a set of pairs < oi , pj > for all candidate panel labels oi 's and candidate image pointers pj 's from the same figure. That is, for a panel with li representing the real label, oi representing the panel label recognized by OCR, and pj 's representing the image pointers in the same figure, we construct a set of pairs < oi , pj >. We label the pair < oi , pj > as positive only if li = pj , otherwise negative. For example, in Figure 1, the real label li for panel a is "a". If OCR recognizes oi where oi ="a", image pointers for the figure are "a" and "b", we construct two pairs, < a, a > labelled as positive and < a, b > labeled as negative. Note that the pair is labelled according to the real label and the image pointers. If unfortunately, OCR recognizes oi incorrectly for panel a in Figure 1, for example oi ="o", we have two pairs, < o, a > labelled as positive and < o, b > labeled as negative. We design features based on oi 's and pj 's. For a base feature set, there are 3 binary features: one boolean value indicating whether oi = pj , one boolean value indicating whether oi lef t = pj - 1 or oi upper = pj - 1, and another boolean value indicating whether oi right = pj + 1 or oi down pj + 1, where i lef t is the index of the left panel of panel i in the same row, i upper is the index of the upper panel of panel i in the same column, pj + 1 is the successive letter of pj and pj - 1 is the previous letter of pj . This feature set takes advantage of the context information by comparing oi lef t to pj -1 and so on. The second and third features capture the first-order dependency. That is, if the neighboring panel (an adjacent panel in the same row or the same column) is recognized as the corresponding "adjacent" letter, there is 6 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft a d g b e h c f i Figure 4. Second-order dep endency. a higher chance that oi is equal to pj . In the inference step for the base learner in the stacked model, if a pair < oi , pj > is predicted as positive, we set the value of oi to be pj since empirically the image pointer extraction has a higher accuracy than the panel label recognition. That is, the predicted value oi is pj for a ^ positive pair and oi remains as oi for a negative pair. After obtaining ^ oi , we recalculate the features via comparing oi 's and pj 's. We call the ^ ^ procedure of predicting < oi , pj >, updating oi , and re-calculating features ^ "stacking". We choose MaxEnt as the base learner to classify < oi , pj > and in our experiments we implement one iteration of stacking. Besides the basic features, we also include another feature that captures the "second-order context", i.e., consider the spatial dependency among all the "sibling" panels, even though they are not adjacent. In general the arrangement of labels might be complex: labels may appear outside panels, or several panels may share one label. However, in the ma jority of cases, panels are grouped into grids, each panel has its own label, and labels are assigned to panels either in column-ma jor or row-ma jor order. The "panels" shown in Figure 4 are typical of this case. For such cases, we analyze the locations of the panels in the figure and reconstruct this grid, i.e., the number of total columns and rows, and also determine the row and column position of each panel. We compute the second-order feature as follows: for a panel located at row r and column c with label o, as long as a w ( = there is a panel located at row r nd column c ith label o r r and = c c) and according to either row-ma jor order or column-ma jor order the , ) g label assigned to panel (r c is o iven the label for panel (r, c) is o, we assign 1 to the second-order feature. For example, in Figure 4, recognizing the panel label "a" at row 1, column 1 would help to recognize "e" at row 2, column 2 and "h" at row 3, column 2. With the first order-features and second-order features, it increases the chance of a missing or mis-recognized label to be matched to an image 7 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft pointer. 4. Exp eriments 4.1. Dataset To evaluate the stacked model for panel label and image pointer matching, we collected a dataset of 200 figures which includes 1070 sub-figures. This is a random subsample of a larger set of papers from the Proceedings of the National Academy of Sciences. Our current approach can only analyse labels contained within panels (internal labels) due to the limitations on the image processing stage therefore in our dataset we only collected figures with internal labels. Though our dataset does not cover all the cases, panels with internal labels are the vast ma jority in our corpus. We hand-labeled all the image pointers in the caption and the label for each panel. The match between image pointers and panels is also assigned manually. 4.2. Baseline algorithms The approaches to find the candidate image pointers and panel labels have been described in Section 3.2. In this paper, we take the hand-code approach and machine learning approach12 as the baseline algorithms for image pointer extraction. The OCR-based approach and grid analysis approach9 are baseline algorithms for panel label extraction. We also compare the stacked model to relational dependency networks (RDNs).14 RDNs are an undirected graphical model for relational data. Given a set of entities and the links between them, a RDN defines a full joint probability distribution over the attributes of the entities. Attributes of an ob ject can depend probabilistically on other attributes of the ob ject, as well as on attributes of ob jects in its relational neighborhood. We build an RDN model as shown in Figure 5. In the RDN model there are two types of entities, image pointer and panel label. For an image pointer, the attribute pj is the value of the candidate image pointer and oi is the candidate panel label. p tru and o tru are the true values to be predicted. The linkage L pre and L next capture the dependency among the sequence of image pointers: L pre points to the previous letter and L next points to the successive letter. P left, P right, P upper, and P down point to the panels to the left, right, upper and down direction respectively. The RDN model takes the candidate image pointers 8 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft Image pointer p_tru e Panel label o_tru e p_ j L_pre L_next equal P_left o_ i P_upper P_down P_right Figure 5. An RDN model and panel labels as input and predicts their true values. The match between the panel label and the image pointer is done via finding the equal-value pair. 4.3. Experimental Results We used 5-fold cross validation to evaluate the performance of the stacked graphical model for image pointer to panel label matching. The evaluation was reported in two ways; the performance on the matching and the performance on image pointer and panel label extraction. To determine the matching is the "real" problem, i.e., what we really care about are the matches, not getting the labels correctly. Evaluation on the image pointer and panel label extraction is a secondary check on the learning technique. Table 1 shows the accuracy of image pointer to panel label matching. For the baseline algorithms, the match was done by finding the equal-value pair. Baseline algorithm 1 was done by comparing the candidate image pointers to the candidate panel labels. Baseline algorithm 2 was done by comparing the image pointers extracted by the learning approach to the panel labels obtained after grid analysis. The stacked graphical model takes the same input as Baseline algorithm 2, i.e., the candidate image pointers extracted by the hand-coded algorithm and the candidate panel labels obtained by OCR. We observe that the stacked graphical model improves the accuracy of matching. Both the first-order dependency and second-order dependency help to achieve a better performance. RDN also achieved a better performance than the two baseline algorithms. Our stacked model achieves a better performance than RDN, because in stacking the dependency is captured and indicated "strongly" by the way we design features. 9 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft Table 1. Accuracy of image p ointer to panel label matching. Image pointer to panel label matching Baseline algorithm 1 Baseline algorithm 2 (current algorithm in SLIF) RDN Stacked model (first-order) Stacked model (second-order) Table 2. 48.7% 64.3% 70.8% 75.1% 81.3% Performance on image p ointer extraction and panel lab el extraction. Image pointer extraction Panel lab el extraction 52.3% 65.7% 73.6% 77.8% 83.1% Baseline algorithm 1 Baseline algorithm 2 RDN Stacked model with first order dependency Stacked model with second order dependency 60.9% 89.7% 85.2% - That is, the stacked model can model the matching as a binary classification of < oi , pj > and capture the first-order dependency and second-order dependency directly according to our feature definition. However, in RDNs, the data must be formulated as types of entities described with attributes and the dependency is modeled with links among attributes. Though RDNs can model the dependency among data, the matching problem is decomposed to a multi-class classification problem and a matching procedure. Besides that, the second-order dependency can not be modeled explicitly in the RDN. Table 2 shows the performance on the sub-task of image pointer extraction and panel label extraction. The results are reported with F1measurement. Since during the stacked model we update the value of oi and set it to be pj when finding a match, the stacking also improves the accuracy of panel label extraction. The accuracy for image pointer extraction remains the same since we do not update the value of pj . Baseline algorithm 1 is the approach of finding candidate image pointers or candidate panel labels. Baseline algorithm 2 for image pointer extraction is the learning approach, and the grid analysis strategy for panel label extraction. The inputs for the stacked graphical model are candidate image pointers and candidate panel labels. We observe that by updating the value of oi , 10 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft a b? c? d (a) A hard case for OCR Figure 6. (b) A hard case for the stacked model Cases where current algorithms fail we can achieve a better performance of panel label extraction, i.e., provide more "accurate" features for stacking. RDN also helps to improve the performance yet the best performance is obtained via stacking. 4.4. Error Analysis As mentioned in Section 2, OCR on panel labels is very challenging and we suffer a low recall of baseline algorithm 1. Most errors occur when there are not enough oi recognized from the baseline algorithm to obtain information of the first-order and second-order dependency. Figure 6(a) shows a case where the current OCR fails. Figure 6(b) shows a case where there is not enough contextual information to determine the label for the upper-left panel. 5. Conclusions In this paper we briefly reviewed the SLIF system, which extracts information on one particular aspect of biology from a combination of text and images in journal articles. In such a system, associating the information from the text and image requires matching sub-figures in a figure with the sentences in the text. We used a stacked graphical model to match the labels of sub-figures with labels of sentences. The experimental results show that the stacked graphical model can take advantage of the context information and achieve a significant improvement in the matching accuracy of the stacked graphical model as compared with a relational dependency network or the current algorithm in SLIF. In addition to accomplish the matching at a higher accuracy, the stacked model helps to improve the performance of finding labels for sub-figures as well. 11 Pacific Symposium on Biocomputing 12:257-268(2007) September 24, 2006 21:31 Proceedings Trim Size: 9in x 6in PSB06draft The idea of stacking is to take advantage of the context information, or the relevance between inter-related tasks. Future work will focus on applying stacked models to more tasks in SLIF, such as protein name extraction. Acknowledgments The work was supported by research grant 017396 from the Commonwealth of Pennsylvania Department of Health, NIH grants K25 DA017357 and R01 GM078622, and grants from the Information Processing Technology Office (IPTO) of the Defense Advanced Research Pro jects Agency (DARPA). References 1. B. de Bruijn and J. Martin, Getting to the (c)ore of know ledge: mining biomedical literature. Int. J. Med. Inf., 67(2002), 7-18. 2. M. Krallinger and A. Valencia Text-mining and information-retrieval services for molecular biology. Genome Biology 2005, 6:224. 3. L. Hunter and K. B. Cohen, Biomedical language processing: what's beyond PubMed? Molecular Cell 21(2006), 589-594. 4. H. Yu and M. Lee. Accessing Bioscience Images from Abstract Sentences. Bioinformatics 2006, 22(14), 547-556. 5. B. Rafkind, M. Lee, SF Chang, and H. Yu. Exploring text and image features to classify images in bioscience literature. Proceedings of BioNLP 2006, 73-80. 6. H. Shatkay, N. Chen, and D. Blostein. Integrating Image Data into Biomedical Text Categorization. Bioinformatics 2006, 22(14), 446-453. 7. R. F. Murphy, Z. Kou, J. Hua, M. Joffe, and W. W. Cohen, Extracting and Structuring Subcel lular Location Information from On-line Journal Articles: The Subcel lular Location Image Finder. Proceedings of KSCE 2004, 109-114. 8. R.F. Murphy, M. Velliste, J. Yao, and G. Porreca,Searching Online Journals for Fluorescence Microscope Images Depicting Protein Subcel lular Locations. Proceedings of BIBE 2001, 119-128. 9. Z. Kou, W. W. Cohen, and R. F. Murphy, Extracting Information from Text and Images for Location Proteomics. Proceedings of BIOKDD 2003, 2-9. 10. M. Ryan and P. Fernando, Identifying Gene and Protein Mentions in Text Using Conditional Random Field. BMC Bioinformatics, 6(Suppl 1):S6, May 2005. 11. Z. Kou, W. W. Cohen, and R. F. Murphy, High-Recal l Protein Entity Recognition Using a Dictionary. Bioinformatics 2005, 21(Suppl 1), 266-273. 12. W. W. Cohen, R. Wang and R. F. Murphy, Understanding Captions in Biomedical Publications. Proceedings of KDD 2003, 499-504. 13. B. Taskar, P. Abbeel and D. Koller, Discriminative probabilistic models for relational data. Proceedings of UAI 2002, 485-492. 14. D. Jensen and J. Neville, Dependency Networks for Relational Data. Proceedings of ICDM 2004, 170-177. 12