Boosting with Incomplete Information Gholamreza Haffari1 Yang Wang1 Shaojun Wang2 Greg Mori1 Feng Jiao3 1 Simon Fraser University, Canada G H A FF A R 1 @ C S . S F U . C A Y WA N G 1 2 @ C S . S F U . C A S H AO J U N . WA N G @ W R I G H T. E D U MORI@CS.SFU.CA F E N G J I AO @ YA H O O - I N C . C O M 2 Wright State University, USA 3 Yahoo! Inc. USA Abstract In real-world machine learning problems, it is very common that part of the input feature vector is incomplete: either not available, missing, or corrupted. In this paper, we present a boosting approach that integrates features with incomplete information and those with complete information to form a strong classifier. By introducing hidden variables to model missing information, we form loss functions that combine fully labeled data with partially labeled data to effectively learn normalized and unnormalized models. The primal problems of the proposed optimization problems with these loss functions are provided to show their close relationship and the motivations behind them. We use auxiliary functions to bound the change of the loss functions and derive explicit parameter update rules for the learning algorithms. We demonstrate encouraging results on two real-world problems -- visual object recognition in computer vision and named entity recognition in natural language processing -- to show the effectiveness of the proposed boosting approach. world problems such as text filtering and routing, ranking, learning in natural language processing, image retrieval, medical diagnosis, and customer monitoring and segmentation (Schapire, 2004). It is very common in real-world machine learning problems that part of the input feature vector is incomplete: either not available, missing, or corrupted. In a web-page ranking problem, for example, using click-though data as part of the features, we find that a small number of valid pages have click features and most do not. In the case of object recognition in computer vision, many approaches assume a part-based model. However, certain parts of the object are hard to detect reliably due to small support in the image, occlusion or clutter, which also lead to missing information. Handling these kinds of classification problems containing incomplete information is a very important and realistic task. Excluding popular EM algorithms for generative models, some methods have been recently proposed for discriminative models (Chechik et al., 2007; Koo & Collins, 2005; Quattoni et al., 2005; Shivaswamy et al., 2006; Bi & Zhang, 2004). In this paper, we show how to handle incomplete data under the boosting approach. We first describe the precise problem we are trying to solve, then we formulate optimization problems where the loss functions consist of two parts, one using partially labeled data and the other using fully labeled data. The primal problems of the proposed optimization problems with these loss functions are provided to show their close relationship and shed light on the rationale behind them. We derive explicit parameter update rules of the learning algorithms by introducing auxiliary functions to bound the change of loss functions. Finally, we demonstrate encouraging results on two real-world problems to show the effectiveness of the proposed boosting approach: visual object recognition in computer vision and named entity recognition in natural language processing. 1. Introduction Boosting is a general supervised learning technique for incrementally building linear combinations of "weak" models to generate a "strong" predicative model. It is one of the most successful and practical methods in machine learning. Over the last decade, it has attracted much attention in the machine learning community and related areas such as statistics. It has been widely applied in many real These authors contributed equally to this work. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Boosting with Incomplete Information 2. Preliminaries Let X X be a random variable over data instances to be labeled, and Y be a random variable over corresponding labels ranging over a finite label alphabet Y . The classification task is to learn a mapping from data instances X to labels fY . Assume we have a set of feature functions w F1 := k (x, y ) here each feature maps X × Y to R. Same as in (Collins et al., 2002; Lebanon & Lafferty, 2002) and without loss of generality, we assume that the range of all feature functions in this paper is [0, 1]. These feature functions correspond to weak learners in boosting and sufficient statistics in an exponential family model. Suppose the target predictor can be derived from a scoring function written af a linear combination of feature s k fk (x, y ). Given a training functions t(x, y ) = k F1 ( , dataset xi , yi ) it has been shown (Lebanon & Lafferty, 2002) that Adaboost (Freund & Schapire, 1997) combines features to minimize the following exponential loss XX xi y 3. Boosting with Hidden Variables The challenge in this paper is, besides using the feature set F1 and training set D1 , how to use the additional feature set F2 and training set D2 to obtain a better approximation for the mapping from instances to labels. To this end, the main object of focus is a mapping from X × H to Y , which is modeled by a conditional probability distribution p (y |x, h). This distribution is called the normalized model and is defined parametrically as ~ ~ p (y |x, h) e 1 ·[f1 (x,y)-f1 (x,yx )]+ 2 ·[f2 (x,h,y)-f2 (x,h,yx )] T T where 1 and 2 are the model's parameter vectors corresponding to features in F1 and F2 , respectively1 . To estimate the parameters of the distribution, we can maximize the conditional likelihood of the training data: L() := X i log p (yi |xi ) + X j log p (yj |xj , hj ) q (y |xi ) (1) i f f k k (x, y ) - fk (x, yx ) ~ where q (y |x) := exp k F1 s called the unnormalized model, and yx denotes the label ~ of instance x over the empirical data. Equivalently, it has been shown (Lebanon & Lafferty, 2002) that Logitboost (Friedman et al., 2000) minimizes the following log loss - X xi where is used to balance the influence of the two data sources on the objective function. Let q0 (h|x) be a fixed distribution representing the prior belief in values of the hidden variable given an instance x, then p (y , h|x) = s q0 (h|x)p (y |x, h) and the firht term in L() can be computed based on p (y |x) = p (y , h|x). We now turn our attention to model the mapping from X × H to Y by a linear scoring function that is the basis of our Adaboost type algorithms. When h is observed, the mapping is defined based on t (x, h, y ) := T · f 1 (x, y ) + T · f 2 (x, h, y ) 1 2 log p (yxi |xi ) ~ (2) where p (y |x) := q (y |x)/Z (x) is called the normalized model. Optimizing the two objective functions above can be done by either parallel or sequential updates (Collins et al., 2002; Lebanon & Lafferty, 2002). Now assume that there is a random variable h H which b ( is hidden in some part of the training data D1 := xi , yi ) ut has been observed in the rest of the training data D2 := ( . xj , hj fyj ) Consider a second set of feature functions , w F2 := k (x, h, y ) here each feature maps X × H × Y to R. In many real-world applications, the number of fully observed instances is much smaller than that of partially observed instances, that is, |D2 | |D1 |, since obtaining fully observed instances is either expensive or timeconsuming. To take full advantage of all available training data, we need to develop new methods, because the information cannot be fully exploited by the original boosting algorithm. Hereafter we use subscripts i and j to range over training data in D1 and D2 respectively. For a datum (x, h, y ), we denote all of its F1 features by the vector f1 (x, y ) and all of its F2 features by the vector f2 (x, h, y ). and when h is hidden, it is defined as t (x, y ) := h q0 (h|x)t (x, h, y ). As before, q0 (h|x) is used to inject prior domain knowledge. To learn the parameters, we pose the minimization of the loss function E () defined as E () := XX i h q0 (h|x) X y q (y |xi , h) + XX j y q (y |xj , hj ) where q (y |xi , h) is called the unnormalized model ~ ~ q (y |x, h) := e 1 ·[f1 (x,y)-f1 (x,yx )]+ 2 ·[f2 (x,h,y)-f2 (x,h,yx )] T T The second term in E () can be thought of as the loss incurred for the j th instance over all possible labels, and the first term as the expected loss for the ith instance. Note that if q0 (h|xj ) puts a point mass on the observed hj for instances in D2 , then E () can be rewritten compactly as E () = 1 xD1 D2 h,y X X q0 (h|x)q (y |x, h) It is equivalent to the more familiar form p (x, h, y ) e by simply removing the constants T T ~ ~ e 1 ·f1 (x,yx )+ 2 ·f2 (x,h,yx ) from the numerator and denominator. T ·f1 (x,y )+ T ·f2 (x,h,y ) 1 2 Boosting with Incomplete Information In the next section, we will show that there is a close relationship between minimizing E () and maximizing the lowerbound () on L(), which is derived based on Jensen's inequality and defined as () := X i,h Theorem 1. The following optimization program: max xD1 D2 h,y X X q0 (h|x)q (yx |x, h) ~ (3) q0 (h|xi ) log p (yi |xi , h) + X j log p (yj |xj , hj ) is the dual of minpS (p,q0 ,F ) K L(p||r) where the ex~ tended K L(p||r) is defined as X x,h By extending q0 to instances in D2 as before, we can write () = xD1 D2 p(x)q0 (h|x) ~ X y i h p(y |x, h) - 1 + r(x, h, y ) p(y |h, x) log r(x, h, y ) X X h q0 (h|x) log p (yx |x, h) ~ ~ and the set S (p, q0 , F ) is defined as n X h i o p(x)Eq0 (h|x)p(y|x,h) f -Ep(y|x) [f ] = 0, f F ~ p M ~ x Furthermore, we will show a close relationship between maximizing L() and minimizing the following lowerbound on E () derived by Jensen's inequality () := XX i y et (xi ,y)-t (xi ,yi ) XX j y + et (xj ,hj ,y)-t (xj ,hj ,yj ) In the test time, depending on whether h is hidden or not, either p (y |x) or p (y |x, h) can be used to determine the class label of a given instance if we use the probabilistic model. Accordingly, for the linear map, either t (x, y ) or t (x, h, y ) can be used. Our definitions of both normalized and unnormalized models are similar to those in (Lebanon & Lafferty, 2002). If we ignore fully labeled data in L(), we get the hidden conditional random field proposed in (Koo & Collins, 2005; Quattoni et al., 2005) by assuming q0 (h|x) to be constant; however, the second term in L() should exist to take advantage of D2 . If we ignore the first term in E (), we get the standard boosting algorithm's loss function; however, the first term is needed to take advantage of the partially observed data D1 . In the next section, we will provide the primal problems for the proposed loss functions to motivate the rationale of optimizing them and show their relationships. We then give sequential and parallel algorithms to optimize E () and L() in section 5. Proof sketch. The key idea in this theorem is the definition ~ of the extended KL divergence and S (p, q0 , F ). Construct the Lagrangian of the dual, which is a constrained optimization problem, take its derivative, and set it to zero. It will give the form of the optimal solution; plug this form back into the Lagrangian, and make the data con~ sistency asyumption (p is the empirical probability distris bution) p(y |x)f (x, y ) = f (x, yx ) for f F1 and ~ y p(y |x)f (x, h, y ) = f (x, h, yx ) for f F2 , we will ~ obtain the optimization problem in (3) . Theorem 2. The following optimization program: max xD1 D2 X X h q0 (h|x) log p (yx |x, h) ~ (4) is the dual of minpS (p,q0 ,F ) K L(p||r) where the ex~ tended K L(p||r) is defined as in Theorem 1, and o n X ~ ~ p(y |x, h) = 1 S (p, q0 , F ) := p S (p, q0 , F )x, h : y 4. Primal and Dual Programs It is well known (Lebanon & Lafferty, 2002) that for standard boosting with no hidden information, the primal optimization problems for Adaboost and Logitboost are the same except for the additional constraints for the latter to ensure a probabilistic model. For our boosting with incomplete information, this relationship does not exist for the original optimization problems themselves, but rather between E () and () which is the lowerbound on L(). Let the set of non-negative measures M := {m : X × H × Y R+ }, and F := F1 F2 . Let r be the reference measure 1; however, it can be any arbitrary measure that generalizes the objective functions introduced in the previous section. The proof of this theorem is similar to that of Theorem 1 and is omitted because of space constraints. As can be seen from the theorems above, the primal optimization problems corresponding to the objective functions E () and () are the same except fy r the additional constraints for the later o one to ensure p(y |x, h) = 1. The extended KL divergence gives the expected discrepancy between p(y |x, h) and the reference measure r(x, h, y ) where the expectation is taken with respect to the distribution p(x)q0 (h|x). ~ Hence minimizing the extended KL subject to the constraints forces p(y |x, h) to become similar to r, or in particular when the reference measure is 1 or constant, to have more entropy. 5. Learning Algorithms Convergence of boosting algorithms has been studied in various ways. Much work has been done to prove the convergence in terms of an optimization method, which can be categorized into two approaches: greedy function optimization and greedy feature induction. In the first approach, the boosting algorithm is viewed as a sequential gradient descent algorithm (Breiman, 1999; Boosting with Incomplete Information Algorithm 1 Parallel Updates for the Normalized Model 1: repeat 2: for fk F1 do X+ 3: A+ = Ep (y|xi ) [gk (xi , y )]+ k i X+ Ep (y|xj ,hj ) [gk (xj , y )] 4: A- k = E+ (y|xi ) [-gk (xi , y )]+ p i X+ Ep (y|xj ,hj ) [-gk (xj , y )] j l og A - - l og A + j X lution is proved. In this paper we take the second approach to learn the discriminative model. We construct an auxiliary function to bound the change of exponential loss, E ( + ) - E () or log-loss L() - L( + ). Similar to (Collins et al., 2002; Lebanon & Lafferty, 2002), either parallel or sequential updates can be used. By the same argument as in (Collins et al., 2002; Lebanon & Lafferty, 2002), we can show the convergence of these updates to a local minimum of the loss function. For simplicity in presenting the results, we introduce some notation for x D1 D2 : ~ fk F1 , gk (x, y ) = fk (x, y ) - fk (x, yx ) ~ ~ ~ ~~ fk F2 , gk (x, h, y ) = fk (x, h, y ) - fk (x, h, yx ) ~ ~ ~ ~~ f f C := max |gk (x, y )| + ~ |gk (x, h, y )| ~ x,y ,h ~ k F1 k F2 5: 6: 7: 8: k k k = 2C end for for fk F2 do X+ A+ = Ep (y,h|xi ) [gk (xi , h, y )]+ k i X+ Ep (y|xj ,hj ) [gk (xj , hj , y )] 9: - Ak = E+ (y,h|xi ) [-gk (xi , h, y )]+ p i X+ Ep (y|xj ,hj ) [-gk (xj , hj , y )] i l og A - - l og A + k j X (5) (6) ( 7) (8) E+ t) [ (t)] p( := t p(t) (t) : (t)>0 k 10: k = 2C 11: end for 12: for fk F1 F2 do 13: k k + k 14: end for 15: until convergence For the normalized model, the learning algorithm with parallel updates is summarized in Algorithm 1 and with the sequential updates in Algorithm 2. For the unnormalized model, the update rules (parallel or sequential) are exactly the same; the only difference is that we will use q (y |x, h) rather than p (y |x, h) in all the algorithms' equations. For details of the derivation of updating rules in the learning algorithms, see Appendix A. For ease of presentation, we have assumed that the potentially missing attributes are always the same. This is an interesting and nontrivial situation that occurs in many realworld applications, where the missing attribute h is the information that requires expensive human labeling (see the experiments for example applications). However, our approach can be easily extended to the cases where the data could have different missing attributes. In this more general setting, the i-th training datum has the form (xi , yi ) with missing information hi Hi , where Hi can vary for different i's depending on which information is missing. The contribution of this datum to the log loss in the normalized model is simply - log p (yi |xi ). All the arguments in this paper will go through with some minor changes. Friedman et al., 2000; Mason et al., 2000) in function space, inspired by numerical optimization and statistical estimation. It is a forward stage-wise additive modeling that approximates the solution by sequentially adding new basis functions without adjusting the parameters and coefficients of those that have already been added. At each iteration, one solves for the optimal basis function and corresponding coefficients to add to the current expansion. This produces new expansion, and the process is repeated. In the second approach (Collins et al., 2002; Lebanon & Lafferty, 2002), the boosting algorithm is described as a greedy feature induction algorithm to incrementally build random fields. The greediness of the algorithm arises in steps that select the most informative feature. In these steps each feature in a pool of candidate features is evaluated by estimating the reduction in the Kullback-Lieber divergence that would result from adding the feature to the field. This reduction is approximated as a function of a single parameter and is equal to the exponential loss reduction or log loss increment. This approximation is one of the key elements that make it practical to evaluate a large number of candidate features at each stage of the induction algorithm. Various parameter update rules can be derived By using an auxiliary function to bound the change of loss function from above, and thus convergence to the global optimal so- 6. Experiments We evaluate our approach in two real-world problems: visual object recognition in computer vision and named entity recognition in natural language processing. In both cases, we use simple and independent features, so when we calculate the values of A+ and A- , feature expectations j j can be done efficiently. For simplicity, we set to be 1. In practice, this parameter can be set by cross-validation. We set our prior belief in values of the hidden variable given an Boosting with Incomplete Information Algorithm 2 Sequential Updates for the Norm. Model 1: repeat 2: for fk F1 dX Xo 3: A+ = p (y |xi )(1 + gk (xi , y ))+ k 4: A- = k iy X =yi X j p (y |xj , hj )(1 + gk (xj , y )) y =yj XX i p (y |xi )(1 - gk (xi , y ))+ p (y |xj , hj )(1 - gk (xj , y )) y =yi y =yj 5: 6: 7: 8: + k k end for for fk F2 dX X Xo p (y , h|xi )(1 + gk (xi , h, y ))+ A+ = k l og A - - l og A + k k 2 PP j 9: i =y P yPi h j y=yj p (y |xj , hj )(1 + gk (xj , hj , y )) XXX - p (y , h|xi )(1 - gk (xi , h, y ))+ Ak = 10: k 11: end for 12: until convergence l og A - - l og A + k k 2 i =y P yPi j h y =yj p (y |xj , hj )(1 - gk (xj , hj , y )) + k menting the objects (Winn & Shotton, 2006), to specifying a bounding box of the objects (Viola & Jones, 2001), to only indicating the existence of the objects (Fergus et al., 2003). Naturally, there is a trade-off among different levels of supervisions. Manually segmenting the object of interest in an image obviously provides very accurate information for any learning algorithm, but it is very expensive and time-consuming to annotate a large number of images. On the other hand, it is relatively easy to label an image based only on the existence of an object. In our experiment, we assume we have two sets of training images. The first set of images has only class labels associated with them; we represent them as (x, y ), where x refers to the image and y refers to its class label. The second set of images has both class label and the contour of the object being manually labeled; we represent them as (x, h, y ), where h is the information about the contour of the object. Our learning problem is then in precisely the scenario in which our proposed method is expected to be effective. We first run an interest-point detector (Kadir & Brady, 2001) to identity regions of interest on each image. Each interest point is represented by a SIFT descriptor (Lowe, 2004) as a 128-dimensional vector. The SIFT descriptors from all the training images are then vector quantized into K visual words (we choose K = 200 in our experiment) by k -means clustering. All the images are then represented by a bag-of-words representation by counting the occurrence of each visual word in an image. We denote an image as x = (x1 , x2 , ..., xt ), where t is the number of interest points in x, and each xi is an entry to a visual word. The information h about the object contour is represented as h = (h1 , h2 , ..., ht ), where hi is a binary value indicating whether xi is on the object or not. Since we assume the "bag-of-words" model, the summation over h required for calculating A+ and A- can be solved efficiently by j j factoring out the contribution of each interest point. Although bag-of-words representation ignores a lot of positional information between features, previous work (Sivic et al., 2005; Fergus et al., 2005) has demonstrated that it to be quite effective in object recognition tasks. We define the following three sets of features for our boosting algorithm, based on the bag-of-words representation of images. (1) feature fj y (x, y ) is calculated as the count of visual words j in an image x if y = y , and zero otherwise; (2) feature oj y (x, h, y ) is the count of visual words j on the foreground of image x if y = y , and zero otherwise; (3) feature bj y (x, h, y ) is the count of visual words j on the background of image x if y = y , and zero otherwise. Notice that features fj y are always observed for a training image. Features oj y and bj y are observed only when a training image does not have missing information (i.e., the manually labeled object contour). We normalize all the features by the total number of interest points in an image instance, q0 (h|x) to be constant. In all the experiments, we use parallel updates. We have tried sequential updates and find that they are much slower. Although they can achieve higher likelihood on the training data, the results on the test data remain the same. We compare our proposed boosting approach with three different baseline algorithms, in both normalized and unnormalized cases. The first baseline algorithm (BL1) uses both sets of features F1 and F2 , but is trained only on the fully observed training data D2 . The second baseline algorithm (BL2) is trained on all the training data D1 D2 but uses only features F1 , that is, it ignores features F2 that involve the hidden information h. Notice that the second baseline algorithm is identical to the algorithm in (Lebanon & Lafferty, 2002). The third baseline algorithm (BL3) uses all the training data D1 D2 and both types of features F1 F2 but ignores observed h on fully observed data; that is, it assumes all the data are in the form of {(xi , yi )}. Notice that the third baseline algorithm is similar to the hidden conditional random field (Quattoni et al., 2005). 6.1. Visual Object Recognition We first consider a visual object recognition task where some of the data have missing features. In this task, we attempt to classify an image based on the existence of an object of interest in the image. We test our approach on the Caltech 4 dataset: airplanes, cars, faces, and motorbikes. Common approaches to object recognition involve some form of supervision, which may range from manually seg- Boosting with Incomplete Information accuracy log-likelihood Our method 97.22% -0.0916 BL1 89.26% -1.1417 BL2 88.01% -0.5698 BL3 90.43% -0.4375 normalized model accuracy log of loss Our method 94.83% -0.7412 BL1 82.57% -1.1231 BL2 89.86% -0.7977 BL3 87.64% -0.8068 unnormalized model Table 1. Results of our approach on visual object recognition, compared with three baseline algorithms dictions represent its local context, which then used by the classifier. The part-of-speech tag is a valuable source of information and is not available in some annotations of the data sets for this task, so we treat it as the hidden variable that is not observed for some portion of the training data. We could use the sequence of POS tags of the words in the current window as the hidden variable. In that case, we may use a finite state automata to characterize the eligible sequence of POS tags when we want to sum over their values to speed up the training algorithms. The features that we used are summarized in Table 6.2; they are described in more details in (Carreras et al., 2003). Feature Lexical Syntactic Orthographic Affixes Left predict Explanation word forms and their positions in the window part-of-speech tags (when available) capitalized, include digits, ... the suffixes and prefixes (up to four characters) predicted labels for the two previous words to make sure their values are between 0 and 1. During testing, we observed the image x, and we try to infer its label y based on the learned model. Although we can also infer y assuming both x and h are observed during testing, it is actually an unrealistic setting in our application. It requires a perfect figure/ground segmentation of the image x. However, since figure/ground segmentation is itself a very challenging problem in computer vision, it is not reasonable to assume we could have this information during the testing. So we do not investigate this case. Our dataset contains more than 2000 images. We randomly split them equally into training and testing sets. We choose 30% of the training images to be fully observed and the rest to be partially observed. We compare both normalized and unnormalized models with the three baseline algorithms defined above, in terms of classification accuracy and the log-likelihood of the test data. The results are shown in Table 1. We also visualize the most discriminative patches in some sample images in Figure 1. We find that our approach is significantly superior to the three baseline algorithms, in term of both accuracy and log-likelihood on the test images. 6.2. Named Entity Recognition Named entity extraction (NEE) is a subtask of information extraction in which we try to identify names of persons, locations, and organizations in a given set of documents. One approach to this problem is to do first named entity recognition (NER) and then named entity classification (NEC). In this section we apply our method to the NER problem and demonstrate its effectiveness compared to the baseline systems. We consider NER as a sequence labeling problem, that is, specifying a sequence of zero and one for a sentence to classify a word as part of a named entity or not. For each word w, its surrounding words in a window of length 5, its part-of-speech tag (when available), and previous pre- Table 2. Details of the features used for the NER task. Syntactic features belong to F2 and the rest of features belong to F1 . We use the data set of the CONLL 2003 shared task. To reduce the training time, we collapse the original 45 different POS tags into five tags as done in (McCallum et al., 2003). After training the model, we do the classification for each individual position by normalizing the prediction score of the model using the class mass normalization (CMN) procedure as introduced in (Zhu et al., 2003). We compare our approach to the three baseline systems defined before. There are 5K sentences in D1 , 6K sentences in D2 , and 1K sentences in the test set. The first set of experiments show the performance of our model compared to the baselines when, at the test time, only x is available (see Table 3). In the second set of experiments, (x, h) is given at the test time (see Table 4); for this setting, BL2 and BL3 cannot be used. Our method outperforms baseline systems in both sets of experiments in terms of f-measure and log-likelihood or loss function. 7. Comparison to the Related Work Originally boosting is considered as a way to boost weak learners to strong learners by: learning weak hypotheses to classify hard examples in each round, and finally combining these weak hypotheses. Another view to boosting is through the statistical perspective which interprets it as: optimizing some objective function via parallel or sequential updates to determine the weights of all possible weak hypotheses (aka features). There is a debate between the statistic and algorithmic perspective; see (Mease & Wyner, 2008) for more information. Our work takes the statistical perspective and do not engage in that debate. Boosting with Incomplete Information Figure 1. Visualization of the most discriminative patches in each image. Our method BL1 BL2 BL3 f-measure log-likelihood 49.45% -0.5784 46.63% -0.5932 48.10% -0.5803 47.80% -0.5880 normalized model f-measure log of loss Our method 49.04% -2.6337 BL1 46.24% -2.6458 BL2 47.58% -2.6378 BL3 46.39% -2.6434 unnormalized model 8. Conclusions and Further Work In this work we have presented a novel boosting approach that extends the traditional boosting framework by incorporating hidden variables such that fully labeled data can be integrated with partially labeled data to form a powerful strong classifier. Thus, compared with both the original boosting algorithms and hidden CRF, our model performs better in two real-world problems by fully exploiting relevant complete information of data resources. We consider only simple independent features in our model. In fact, the hidden variables may have complex dependencies that respect certain cyclic graph structure; then it may be necessary to use variational methods, such as loopy belief propagation, to compute feature expectation for the values of A+ and A- . As future work, we would like to incorporate more complex dependent features in these two applications. Table 3. Results of our approach on the NER task, compared with three baseline algorithms when only x is given in the test data. f-measure log-likelihood Our method 59.60% -0.5759 BL1 56.51% -0.5916 normalized model f-measure log of loss Our method 60.17% -0.2586 BL1 55.46% -0.2655 unnormalized model Table 4. Results of our approach on the NER task, compared with the baseline algorithm BL1 when (x, h) is given in the test data. Even by having h, namely POS tags, the NER task is not easy. Acknowledgment SW is supported in part by the Research Challenge Grant at Wright State University. We would like to thank Anoop Sarkar, David Blei and anonymous reviewers for their constructive comments. References Bi, J., & Zhang, T. (2004). Support vector classification with input data uncertainty. NIPS. Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation. A related algorithm that takes the first perspective to boosting is AdaBoost with confidence-rated predictions in which a weak learner outputs a real value representing the confidence level (Schapire & Singer, 1999). When provided with incomplete input, the weak learner's contribution is its uncertainty about its vote, which is represented by the produced real number. The details of the connection between our approach and confidence-rated AdaBoost will be an interesting topic to explore for future research. ` ´ Carreras, X., Marquez, L., & Padro, L. (2003). A simple named entity extractor using AdaBoost. Proceedings of CoNLL. Chechik, G., Heitz, G., Elidan, G., Abbeel, P., & Koller, D. (2007). Max-margin classification of incomplete data. NIPS. Collins, M., Schapire, R. E., & Singer, Y. (2002). Logistic regression, adaboost and bregman distances. Machine Learning. Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from google's image search. IEEE International Conference on Computer Vision. Boosting with Incomplete Information Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. CVPR. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28, 337­407. Kadir, T., & Brady, M. (2001). Scale, saliency and image description. International Journal of Computer Vision, 45, 83­105. Koo, T., & Collins, M. (2005). Hidden-variable models for discriminative reranking. Proceedings of EMNLP. Lebanon, G., & Lafferty, J. D. (2002). Boosting and maximum likelihood for exponential models. NIPS. Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (2000). Functional gradient techniques for combining hypotheses. In Advances in large margin classifiers, 221­246. MIT Press. McCallum, A., Rohanimanesh, K., & Sutton, C. (2003). Dynamic conditional random fields for jointly labeling multiple sequences. NIPS Workshop on Syntax, Semantics, Statistics. Mease, D., & Wyner, A. (2008). Evidence contrary to the statistical view of boosting. JMLR, 9. Quattoni, A., Collins, M., & Darrell, T. (2005). Conditional random fields for object recognition. NIPS. Schapire, R. (2004). The boosting approach to machine learning: An overview. In Nonlinear estimation and classification. Springer. Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297­336. Shivaswamy, P. K., Bhattacharyya, C., & Smola, A. J. (2006). Second order cone programming approaches for handling missing and uncertain data. JMLR, 7. Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their location in images. IEEE International Conference on Computer Vision. Viola, P., & Jones, M. (2001). Robust real-time object detection. Workshop on Statistical and Computational Theories of Vision - Modeling, Learning, Computing, and Sampling. Winn, J., & Shotton, J. (2006). The layout consistent random field for recognizing and segmenting partially occluded objects. CVPR. Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using gaussian fields and harmonic functions. ICML. Appendix A. Deriving Learning Algorithms Exponential loss We derive parallel updates for exponential loss. Let + be the new parameters value. We find an upper-bound to the change of objective function E ( + ) - E () by an auxiliary function, and then minimize the bound. E ( + ) - E () = " " X q0 (h|xi )e.G(xi ,h,y) e.G(xi ,h,y) - 1 + i,h,y X j,y " " e.G(xj ,hj ,y) e.G(xj ,hj ,y) - 1 " " P g esk (xi ,h,y)C kk + wxi ,h,y - 1 + q (h, y |xi ) C i,h,y P sk (xj ,hj ,y )C " " X k gk e q (y |xj , hj ) + wxj ,hj ,y - 1 C j,y X := A(, ) k |gk (x,h,y)| , G(x, h, y ) is a vecwhere wx,y,h = 1 - C tor built from gk (x, h, y ), and sk (x, h, y ) is the sign of gk (x, h, y ). We find the stationary point of the auxiliary function A(, ) with respect to k by taking the derivative and setting it to zero, which gives us the updating rules. The sequential updates can also be derived similarly. Log loss The objective is to minimize the objective function -L( + ) + L(). First we find an upper-bound on the objective function: L() - L( + ) = X X +·G(x ,h,y) X ·G(x ,h,y) i i log e - log - e i y ,h y ,h X j log X y e +·G(xj ,hj ,y ) - log X y e·G(xj ,hj ,y) = X i log X y ,h p (h, y |x)e·G(xi ,h,y) + p (y |xj , hj )e·G(xj ,hj ,y) X j log X y XX i y ,h p (h, y |x)e·G(xi ,h,y) p (y |xj , hj )e·G(xj ,hj ,y) + XX j y The inequality holds because log x x. The last expression can be upper-bounded again (using a similar technique used for exponential loss), and the resultant upper-bound will be the auxiliary function. It can be shown that the update rules are the same to unnormalized model, but the difference is to use p (·) rather than q (·). The update rules for sequential updates can be derived similarly.