Abstract
We propose a submodular reranking algorithm to boost image retrieval performance based on multiple ranked lists obtained from multiple modalities in an unsupervised manner. We formulate the reranking problem as maximizing a submodular and nondecreasing objective function that consists of an information gain term and a relative ranking consistency term. The information gain term exploits relationships of initially retrieved images based on a random walk model on a graph, then images similar to the query can be found through their neighboring images. The relative ranking consistency term takes relative relationships of initial ranks between retrieved images into account. It captures both images with similar ranks in the initial ranked lists, and images that are similar to the query but highly ranked by only a small number of modalities. Due to its diminishing returns property, the objective function can be efficiently optimized by a greedy algorithm. Experiments show that our submodular reranking algorithm is effective and efficient in reranking images initially retrieved by multiple modalities. Our submodular reranking framework can be easily generalized to any generic reranking problems for realtime search engines.
Approach
We present a submodular objective function for reranking images retrieved by multiple feature modalities, which is very efficient and fully unsupervised. It consists of two terms.
 An information gain term which models pairwise relationships of retrieved images represented by graphs.

Convert ranked lists to graphs
Denote \(\mathcal{V} = \mathcal{V}_1 \cup \mathcal{V}_2 \cup \cdots \cup \mathcal{V}_M\), we aim to select a subset of nodes \(\mathcal{S}\) from \(\mathcal{V}\) which are the most similar to the query image. Let \(\mathcal{U}\) denote the set of images which are not selected.

Information gain term for each graph is defined as
$$F_m (\mathcal{S}) = H(\mathcal{V}_m \backslash \mathcal{S})  H(\mathcal{V}_m \backslash \mathcal{S}\mathcal{S})$$
which can be represented by a random walk model. The final information gain term of all graphs is $$R(\mathcal{S}) = \sum_m (\sum_{v \in \mathcal{V}\backslash \mathcal{S}} p_m(v) \log p_m(v)  \sum_{v \in \mathcal{V} \backslash \mathcal{S}, s \in \mathcal{S}} p_m(v,s) \log p_m(vs))$$
which is a submodular and monotone function.

Convert ranked lists to graphs
 A relative ranking consistency term which exploits the interrelationships among multiple ranked lists obtained by different feature modalities.
 The relative ranking consistency (RRC) is motivated by two criterion: relationships of relative ranks between retrieved images should be maintained, and an image that is similar to the query but highly ranked by only a smaller number of modalities should also be captured.
We define the relative ranking between two images as \(rr_m (v_i,v_j) =  r_{m,v_i}  r_{m,v_j} , v_i, v_j \in \mathcal{V}\), where \(r_{m,v_i}\) is is the position of image \(I_i\) in the \(m\)th ranked list. Then the relative ranking consistency measure across multiple ranked lists is $$\mathcal{C} (v_i, v_j) = \frac{1}{Z} \sum_{m,m' \in M, m \neq m'} 1  \frac{\min (rr_m, rr_{m'})} {K}$$
where \(Z = \frac{M(M1)}{2}\) is a normalization factor corresponding to the number of all possible modality pairs. If two images have similar ranks, they have high RRC score and are similarly ranked after reranking. If an image is only highly ranked by a small number of modalities, it still has relatively high RRC score. The complete RRC term is $$T(\mathcal{S}) = (1q) \sum\nolimits_{s=1}^{\mathcal{S}} q^s \cdot \frac{1}{s}\sum\nolimits_{v_i, v_j \in \mathcal{S}, r_{v_i} < r_{v_j} = s} \mathcal{C} (v_i, v_j)$$
\(q\) is a predefined decay weight to give higher ranked images more importance. \(T(\mathcal{S})\) is also submodular and nondecreasing.
 The relative ranking consistency (RRC) is motivated by two criterion: relationships of relative ranks between retrieved images should be maintained, and an image that is similar to the query but highly ranked by only a smaller number of modalities should also be captured.

Combining the information gain and relative ranking consistency terms, we obtain the final objective function \(Q(\mathcal{S})= R(\mathcal{S}) + \lambda T(\mathcal{S})\) for the reranking problem. The solution is obtained by maximizing the objective function $$\max_{\mathcal{S}} \quad R(\mathcal{S}) + \lambda T(\mathcal{S}) \\
s.t. \quad \mathcal{S} \subseteq \mathcal{V}, \mathcal{S} \leq K_s$$
\(K_s\) is the largest number of selected images, which means we only select and rerank at most \(K_s\) images. It can be approximately optimized by a greedy algorithm.
Experimental Results
 Datasets and features
 Holidays: 1491 image from 500 categories, where the first image in each category is used as a query.
 Ukbench: 10200 images from 2550 objects or scenes.
 Oxford and Paris: 5062 and 6412 photos of famous landmarks in Oxford and Paris, respectively. Both datasets have 55 queries, where multiple queries are from the same landmark.
 BoW vectors from Hessian affine + SIFT descriptor using single assignment and approximate kmeans (AKM). Standard tfidf weighting is used.
 1192dimension GIST feature.
 4000dimension HSV color feature with 40 bins for H and 10 bins for S and V components.
 Results
Comparisons with stateoftheart approaches. We NS score (average top 4 hits) on UKbench and mAP (in %) on other datasets.
Comparisons with other rank aggregation approaches. Runtime (in second) of reranking 1000 images for a single query using direct greedy optimization and lazy evaluation is shown in the rightmost columns.
Code and Datasets
The MATLAB implementation (version 1.0) can be downloaded from here. Due to large size of the precomputed data (similarity matrices), only an example on Holidays dataset is included. If you would like to use the precomputed data of the other three datasets, please send me an email. I would be happy to share the data via some other ways.Publications

Submodular Reranking with Multiple Feature Modalities for Image Retrieval
Fan Yang, Zhuolin Jiang and Larry S. Davis
12th Asian Conference on Computer Vision (ACCV), Accepted, 2014. (Oral) [paper][supp. mat.]