GRIFT: A graphical model for inferring visual classification features from human data
Michael G. Ross Psychology Department University of Massachusetts Amherst Amherst, MA 01003 mgross@mit.edu Andrew L. Cohen Psychology Department University of Massachusetts Amherst Amherst, MA 01003 acohen@psych.umass.edu

Abstract
This paper describes a new model for human visual classification that enables the recovery of image features that explain human subjects' performance on different visual classification tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classifier operating on raw image pixels. Instead, it models classification as the combination of multiple feature detectors. This approach extracts more information about human visual classification than has been previously possible with other methods and provides a foundation for further exploration.

1

Introduction

Although a great deal is known about the low-level features computed by the human visual system, determining the information used to make high-level visual classifications is an active area of research. When a person distinguishes between two faces, for example, what image regions are most salient? Since the early 1970s, one of the most important research tools for answering such questions has been the classification image (or reverse correlation) algorithm [1]. This paper describes a new approach, GRIFT (GRaphical models for Inferring Feature Templates), which extends the traditional classification image method. While classification images represent human visual discrimination as a single linear classifier, GRIFT models it as the non-linear combination of multiple independently computed feature templates. This allows GRIFT to extract more detailed information about human classification. Additionally, utilizing a Bayes net representation provides a principled method for applying prior knowledge to the problem of human feature inference. This paper describes GRIFT and the algorithms for applying it to experimental data. It also demonstrates the model's efficacy on simulated and human data and concludes with a discussion of future research directions.

2

Related work

Ahumada's classification image algorithm [1] models an observer's classifications of visual stimuli with a noisy linear classifier -- a fixed set of weights and a normally distributed threshold. The random threshold accounts for the fact that observers do not always consistently classify a stimulus when it is presented multiple times. In a typical classification image experiment, subjects are presented with hundreds or thousands of noise-corrupted examples from two categories and asked to classify each one. The noise ensures that the samples cover a large volume of the sample space in order to allow recovery of a unique linear classifier that best explains the data. More recently, Gosselin and Schyns developed the Bubbles method [8], which attempts to localize classification-salient image regions by applying spatial and frequency masks to the stimuli the subjects classify. 1


4-square

S i Fi i 0 C i N
class 1

targets

samples class 1

targets

faces

samples

class 2

light-dark
targets class 1 samples

Figure 1: Left: The GRIFT model is a Bayes net that describes classification as the result of combining N feature detectors. Right: Targets and sample stimuli from the three experiments. Although classification images are useful in many domains, it is well established that there are domains in which recognition and classification are the result of combining the detection of parts or features, rather than applying a single linear template. For example, Pelli et al. [11], have convincingly demonstrated that humans recognize noisy word images by parts even though better performance can be achieved by integrating information from the entire word. Similarly, Gold et al. [7] verified that subjects employed feature-based classification strategies for some simple artificial image classes. GRIFT takes the next step and infers features which predict human performance directly from classification data. Most work on modeling non-linear, feature-based classification in humans has focused on verifying the use of a predefined set of features. Recent work by Cohen et al. [4] demonstrates that Gaussian mixture models can be used to recover features from human classification data without specifying a fixed set of possible features. The GRIFT model, described in the remainder of this paper, has the same goals as the previous work, but removes several limitations of the Gaussian mixture model approach, including the need to only use stimuli the subjects classified with high confidence and the bias that the signals can exert on the recovered features. GRIFT achieves these and other improvements by generatively modeling the entire classification process with a graphical model. Furthermore, the strong similarity between single-feature GRIFT models and the classification image process, which is described in more detail below, make GRIFT a natural generalization. Many visual models employ features that are more general than the linear templates described in this paper. Potential generalizations of the GRIFT feature model are described in the conclusion.

3

GRIFT model

GRIFT models classification as the result of combining N conditionally independent feature detectors, F = {F1 , F2 , . . . , FN }. Each feature detector is binary valued (1 indicates detection), as is the classification, C (1 indicates one class and 2 the other). The stimulus, S , is an array of continuously valued pixels representing the input image. The stimulus only influences C through the feature detectors, therefore the joint probability of a stimulus and classification pair is P (C, S ) = F P (C |F )P (S )
N i

class 2

P (Fi |S ).

Therefore they form the Bayesian network model shown in Figure 1. The boxed region in the figure indicates the parts of the model that are replicated when N > 1 -- each feature is represented by an independent copy of those variables and parameters. 2

class 2


The distribution of the stimulus, P (S ), is under the control of the experimenter. Fitting the GRIFT model to experimental data only relies on the assumption that across trials the stimuli are independent and identically distributed. The conditional distribution of each feature detector's value, P (Fi |S ), is modeled with a sigmoid function on the pixel values of S . The sigmoid is desirable because it is a probabilistic linear classifier. Humans are able to successfully classify images in the presence of extremely high additive noise, which suggests the use of averaging and contrast, linear computations which are known to play important roles in human visual perception [10]. Just as the classification image used a random threshold to represent uncertainty in the output of its single linear classifier, the sigmoid also allows GRIFT to represent uncertainty in the output of each of its N features. The conditional distribution of C is represented by a sigmoid on the feature outputs. Each feature distribution is governed by two parameters, a weight vector i and a threshold i , such that j|S | P (Fi = 1|S, i , i ) = (1 + exp(i + ij Sj ))-1 ,
=1

where |S | is the number of pixels in a stimulus. Similarly, the conditional distribution of C is determined by  = {0 , 1 , . . . , N } where P (C = 1|F, ) = (1 + exp(0 + iN
=1

i Fi ))-1 .

Detecting a feature with negative i increases the probability that the subject will respond "class 1," those with positive i are associated with "class 2" responses. Even for small images and few features, this model has many parameters: i is a vector with as many dimensions as there are pixels in S and each feature also contributes a i and i parameter. The fact that the F variables are hidden also substantially increases the challenge of fitting this model to data. Therefore, GRIFT also defines prior distributions on its parameters. These priors reflect reasonable assumptions about the parameter values and, if they are wrong, can be overturned if enough contrary data is available. The prior on the  parameters is a mixture of two normal distributions, P (i ) = (2 )-1/2 (exp(-(i + 2)2 /2) + exp(-(i - 2)2 /2))/2. This reflects the assumption that each feature should have a significant impact on the classification, but no single feature should make it deterministic -- a single-feature model with 0 = 0 and 1 = -2 has an 88% chance of choosing class 1 if the feature is active. The  parameters have an improper non-informative prior, P (i ) = 1, indicating no preference for any particular values [5] because the best i for a feature is largely determined by its i and the distribution of S . The i parameters, which each have dimensionality equal to the stimulus, present the biggest inferential challenge. Because useful features often compare the relative brightness of two image regions, or perform local averaging to suppress noise, it is reasonable to assume that the components of i that correspond to adjoining image pixel locations should be similar. To encourage this, the  parameters have a Markov random field prior distribution: ( (ij - ik )2 P (i )  exp(- ), 2
j,k)A

where A is the set of neighboring pixel locations in the stimulus. For the purpose of fitting the model, there is no need to normalize this distribution by computing its partition function. Combining all of the parameters and variables into a single Bayesian probability distribution, the model is described by: P (C, F, S,  ,  , ) = P (C |F, )P (S )P (0 ) iN
=1

P (Fi |S, i , i )P (i )P (i )P (i ).

(1)

4

GRIFT algorithm

The goal of the algorithm is to find the parameters that satisfy the prior distributions and best account for the (S, C ) samples gathered from a human subject. Mathematically, this corresponds to finding 3


the mode of P ( ,  , |S, C), where S and C refer to all of the observed samples. The algorithm is derived from the expectation-maximization (EM) method, a widely used optimization technique for dealing with hidden variables [3], in this case F, the feature detector outputs for all the trials. In order to determine the most probable parameter assignments, the algorithm chooses random initial parameters  = (  ,   ,  ) and then finds the  that maximizes: F Q(| ) = P (F|S, C,  ) log P (C, F, S|) + log P (). The  that maximizes Q then becomes  for the next iteration, and the process is repeated until convergence. The presence of both the P (C, F, S|) and P () terms encourages the algorithm to find parameters that explain the data and match the assumptions encoded in the parameter prior distributions. As the amount of available data increases, the relative influence of the priors decrease, so it is possible to discover features that are contrary to prior belief given enough evidence for their existence. Using the conditional independences from the Bayes net: + l F iN   Q(| )  P (F|S, C,  ) og P (C|F, ) + log P (Fi |S, i , i )
=1

log P (0 ) +

iN
=1

log P (i ) + log P (i ),

dropping the log P (S) term, which is independent of the parameters, and the log P (i ) terms, which are 0. As mentioned before, the normalization terms for the log P (i ) elements can be ignored during optimization -- the log makes them additive constants to Q. The functional form of every additive term is described in Section 3, and P (F|S, C,  ) can be calculated using the model's joint probability function (Equation 1). Each iteration of EM requires maximizing Q, but it is not possible to compute the maximizing  in closed form. Fortunately, it is relatively easy to search for the best . Because Q is separable into many additive components, it is possible to efficiently compute its gradient with respect to each of the elements of  and use this information to find a locally maximum  assignment using the scaled conjugate gradient algorithm [2]. Even a locally maximum value of  usually provides good EM results -- P ( ,  , |S, C) is still guaranteed to improve after every iteration. The result of any EM procedure is only guaranteed to be a locally optimal answer, and finding the globally optimal  is made more challenging by the large number of parameters. GRIFT adopts the standard solution of running EM many times, each instance starting with a random  , and then accepting the  from the run which produced the most probable parameters. For this model and the data presented in the following sections, 20-30 random restarts were sufficient.

5

Experiments

The GRIFT model was fit to data from 3 experiments. In each experiment, subjects classified stimuli into two classes. Each class contained one or more target stimuli. In each trial, the subject saw a stimulus (a sample from S ) that consisted of a randomly chosen target with high levels of independent identically distributed noise added to each pixel. The noise samples were drawn from a truncated Gaussian distribution to ensure that the stimulus pixel values remained within the display's output range. Figure 1 shows the classes and targets from each experiment and a sample stimulus from each class. The four-square experiment asked subjects to distinguish between two artificial stimulus classes, one in which there were bright squares in the upper-left or upper-right corners and one in which there were bright squares in the lower-left or lower-right corners. The light-dark experiment asked subjects to distinguish between three strips that each had two light blobs and three strips that each had only one light blob. Finally, the faces task asked subjects to distinguish between stimuli produced from two faces. The four-square data were originally collected by [7] and were also analyzed in [4]. The other data are newly gathered. Each data set consists of approximately 4000 trials from each subject. Subjects were given auditory feedback after each trial that indicated success or failure to maintain their interest in the task across the thousands of trials. 4


0.18

2sim
mutual information

0.16 0.14 0.12 0.1 0.08 0.06 0.04 1 0.24

2sim 4sim

4sim

AC

2 AC EA JG RS

3 #features

4

5

EA
mutual information

0.22 0.2 0.18 0.16 0.14 0.12 0.1

JG

RS

0.08 1

2

#features

3

4

Figure 2: Left: The most probable  parameters found in the four-square task. Results include recovery from a simulated two-feature observer, a simulated four-feature observer, and four human subjects (white indicates very positive  components, black indicates very negative components, neutral gray indicates zero). Right: The mutual information between the features and the classifications.

Fitting GRIFT models is not too sensitive to the random initialization procedure used to start each EM instance. The  parameters were initialized by random samples from a normal distribution and then half were negated so the features would tend to start evenly assigned to the two classes. In the four-square experiments, the   parameters were initialized by a mixture of Gaussians distribution and in the light-dark experiments they were initialized from a uniform distribution. In the faces experiments the   were initialized by adding normal noise to the optimal linear classifier separating the two targets. Because of the large number of pixels in the faces stimuli, the other initialization procedures frequently produced initial assignments with extremely low probabilities, which led to numerical precision problems. In the four-square experiments, the   were initialized randomly, but in the other experiments they were set to the optimal threshold for distinguishing the classes using the initial   as a linear classifier. Altering the initialization procedure did not appear to change results, it only affected the speed of EM and the number of restarts required. In the four-square experiment, the noise levels were continuously adjusted to keep the subjects' performance at approximately 71% using the stair-casing algorithm [9]. This performance level is high enough to keep the subjects engaged in the task, but allows for sufficient noise to explore their responses in a large volume of the state space. After an initial adaptation period, the noise level remains relatively constant across trials, so the inter-trial dependence introduced by the staircasing can be safely ignored. Two simulated observers were created to validate GRIFT on the four-square task. Each one used a GRIFT model with pre-specified parameters to probabilistically classify four-square data at a fixed noise level, which was chosen to produce approximately 70% correct performance. One model used four features, one to detect each bright corner, the other used only two features, one sensitive to the top of the stimulus being brighter than the bottom, the other sensitive to the bottom of the stimulus being brighter than the top. The result of using GRIFT to recover the features are collected in Figure 2. Only the  parameters are displayed because they are the most informative. Dark pixels indicate image regions that a feature is weighting negatively and bright pixels correspond to positive weights. Clusters of similarly weighted pixels represent regional averages and the presence of dark and light regions in a feature indicate the computation of contrasts between those areas. The sign of the weights is usually not significant -- given a fixed number of features, there are typically several sets of features that assign identical log likelihoods to the data and only differ from each other in the signs of their  terms and the associated  and  values. Because the optimal number of features for human subjects is 5


unknown, GRIFT models with 1-4 features were fit to the data from each subject. One method of determining the correct number of features would be to hold out a test set or perform crossvalidation. Unfortunately, simulation indicates that a suitable test set would require several thousand samples and computational expense makes cross-validation extremely time-consuming with the current MATLAB implementation. Instead, after recovering the parameters, the mutual information between the hidden F variables and the observed classifications C were estimated. Mutual information measures how well the feature detector outputs can predict the subject's classification decision. Unlike the log likelihood of the observations, which is dependent on the choice to model C with a sigmoid function, mutual information does not assume a particular relationship between F and C . Plotting the mutual information as the number of features, N , is increased provides a sense of the relative utility of the features found for different values of N . Across all subjects, the best one-feature model was based on the contrast between the top of the image and the bottom. This is extremely similar to the result produced by a classification image of the data, reinforcing the strong similarity between one-feature GRIFT and that approach. Fitting GRIFT to the simulated observers demonstrated that if the model is accurate, the correct features can be recovered reliably. The 2sim observer showed no substantial increase in mutual information as the number of features increased from 1 to 4. Each set of recovered features included a top-bottom contrast feature and other noisy features that did not contribute much to predicting C . Although the observer truly used two features, a top-bright and a bottom-bright detector, the recovery of only one top-bottom contrast feature is a success because one contrast feature plus a suitable 0 term is logically equivalent to the original two-feature model. The 4sim observer showed a substantial increase in mutual information as N went from 1 to 4 and the  values clearly indicate four corner-sensitive features. The 4sim data was also tested with a five-feature GRIFT model ( not shown) which produced four corner features and one noisy uninterpretable feature. Its gain in mutual information was smaller than that observed on any of the previous steps. Note the corner areas in the features recovered from the 4sim data are sometimes black and sometimes white. Remember that these are not image pixel values that the features are attempting to match, but positive and negative weights. The important factor is that the corners have a large weight and the other pixels have a large weight of the opposite polarity, indicating a contrast feature that is sensitive to the presence or absence of relative brightness in that corner. Even though targets consisted of four bright-corner stimuli, recovering the parameters from the 2sim observer never produced  values indicating corner-specific features. One of the important advantages of GRIFT over previous methods such as [4] is that there is no problem with targets "contaminating" the features. The simulator demonstrates that the recovered features are those that explain the classification results and will not appear simply because of the structure of the targets and classes. The data of the four human subjects revealed some interesting differences. The largest disparity was between subject EA and subject JG. EA's data indicated no consistent pattern of mutual information increase after two features, and the two-feature model appears to contain two top-bottom contrast features. Therefore, it is reasonable to conclude that EA was not explicitly detecting the corners. At the other extreme is subject JG, whose data shows four very clear corner detectors and a steady increase in mutual information up to four features. Therefore, it seems very likely that this subject was matching corners and probably should be tested with a five-feature model to gain additional insight. AC and RS's data suggest three corner features and a top-bottom contrast. In the light-dark and faces experiments, stair-casing was used the adjust the noise level to the 71% performance level at the beginning of each session and then the noise level was fixed for the remaining trials to improve the independence of the samples. Subjects were promised a reward for achieving the highest score on the task. Two subjects were run in the light-dark case, S2 and S3. S2 achieved the expected performance level (73%), while S3's performance was near chance (55%). It appears that S3's poor performance was the result of too much success during the stair-casing period, which resulted in a very high noise level for the rest of the trials. This fortunate circumstance provides an informative contrast to S2. The result of fitting the GRIFT model to the subjects' data are in Figure 3. Given the differences in performance, it is unsurprising that GRIFT recovered very distinct features for S2 but not for S3. The S2 model with the highest mutual information has four features, one

6


0.0323

S1

faces

0.0322

S1
mutual information

0.0322 0.0321 0.0321 0.032 0.032 0.0319

S2 light-dark

0.0319 1 0.1 S2 S3

#features

2

0.08

mutual information

0.06

0.04

S3

0.02

0 1

2

3 #features

4

5

Figure 3: Top: The most probable  parameters found for subject S1 in the faces experiment. Bottom: The most probable  parameters found for subjects S2 and S3 in the light-dark experiment. White indicates very positive  components, black indicates very negative components, neutral gray indicates zero. Right: The mutual information between the features and classifications in the faces (top) and light-dark (bottom) experiments.

feature that is sensitive to the overall brightness and three that are sensitive to the particular positions of light or dark spots. Fitting a five-feature model resulted in no increase in mutual information and merely added a fifth feature which is always active and, therefore, useless. For subject S2, the mutual information declined slightly between the two and three-feature models, and then increased substantially for 4 features. Consider the features in these three instances. The two-feature model contains an overall brightness feature (as does the one-feature model), but that feature disappears in the three-feature case and then reappears with four features. (In the figure, the  values for each feature are scaled relative to the other features in the same model.) The fourfeature model is the three-feature model with an overall brightness feature added to it. It appears that the features of the three-feature model are only effective in our model if they are all present -- they are jointly, but not singly, informative. Therefore when only three features are available, they exclude the overall brightness feature and the optimal assignment includes all four features. The S3 model does not show a substantial increase in mutual information as the number of features rises. This leads to the conclusion that the one-feature model is probably the best fit, and since performance was extremely low, it can be assumed that the subject was reduced to near random guessing much of the time. The clear distinction between this case and the results from subject S2 demonstrate the effectiveness of GRIFT and the mutual information measure in distinguishing two different classification strategies and again show how over-fitting can be avoided. The faces experiment presented the largest challenges. The stimuli were two unfiltered faces from Gold et al.'s data set [6]. They were down-sampled to 128x128 for use as targets in our experiment. After the experiment, the stimuli were down-sampled further to 32x32 and the background surrounding the faces was removed by cropping, reducing the stimuli to 26x17. These steps were necessary to make the EM algorithm computationally feasible, and to reduce the number of model parameters so they would sufficiently constrained by 4000 samples. The face results for subject S1 are in Figure 3. The one-feature model recovered  values that place weights across the entire face, especially near the nose, eyes, and other features. The two-feature model has only slightly better mutual information. It duplicates (with a sign change) the feature of the one-feature model and then adds a second minimally informative feature, which is almost never active. The conclusion must be that the one-feature model fits best. Although finding a two-feature 7


face model would have been interesting, this result reinforces previous work that indicates that visual face processing has a holistic, configural aspect [12]. This may be especially true when detection of individual features is difficult, as was the case in this experiment. Across all of these experiments, the data collected were completely compatible with the original classification image method. In fact, the four-square human subject data were originally analyzed using that algorithm. One of the advantages of GRIFT is that it allows the reanalysis of old data to reveal new information. Even in the one-feature case, GRIFT allows the user to use prior probability functions on the parameters which may enable good performance when data is too scarce for the classification image approach, while fitting multi-feature GRIFT models can reveal previously hidden non-linear classification strategies.

6

Conclusion

This paper has described the GRIFT model for determining the features used in human image classification. GRIFT is an advance over previous methods that assume a single linear classifier on pixels because it describes classification as the combination of multiple independently detected features. It provides a generative, probabilistic model of classifications that can account for human classification data and incorporate prior knowledge and assumptions about these features. The features it produces are associated with the classification strategy employed by the subject and are not the result of structure in the targets in the classes. GRIFT's value has been demonstrated by modeling the performance of human subjects on the foursquare and light-dark tasks. Its inability to find multiple features when analyzing human performance on the faces task fits with previous results. One of the strengths of the graphical model approach is that it allows easy replacement of model components. An expert can easily change the prior distributions on the parameters to reflect knowledge gained in previous experiments. For example, it might be desirable to encourage the formation of edge detectors. New resolution-independent feature parameterizations can be introduced, as can transformation parameters to make the features translationally and rotationally invariant. If the features have explicitly parameterized locations and orientations, the model could be extended to model their joint relative position, which might provide more information about domains such as face classification. The success of this first version of GRIFT on human data provides a firm foundation for these future developments.

References
[1] A.J. Ahumada, Jr. Classification image weights and internal noise level estimation. Journal of Vision, 2(1), 2002. [2] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [3] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [4] A.L. Cohen, R.M. Shiffrin, J.M. Gold, D.A. Ross, and M.G. Ross. Inducing features from visual noise. Journal of Vision, 7(8), 2007. [5] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, 2003. [6] J.M. Gold, P.J. Bennett, and A.B. Sekuler. Identification of band-pass filtered letters and faces by human and ideal observers. Vision Research, 39, 1999. [7] J.M. Gold, A.L. Cohen, and R. Shiffrin. Visual noise reveals category representations. Psychonomics Bulletin & Review, 15(4), 2006. [8] F. Gosselin and P.G. Schyns. Bubbles: a technique to reveal the use of information in recognition tasks. Vision Research, 41, 2001. [9] N.A. Macmillan and C.D. Creelman. Detection Theory: A User's Guide. Lawrence Erlbaum Associates, 2005. [10] S.E. Palmer. Vision Science: Photons to Phenomenology. The MIT Press, 1999. [11] D.G. Pelli, B. Farell, and D.C. Moore. The remarkable inefficiency of word recognition. Nature, 425, 2003. [12] J. Sergent. An investigation into component and configural processes underlying face perception. British Journal of Psychology, 75, 1984.

8