An empirical study on image bag generators for multi-instance learning

Wei, **u-Shen; Zhou, Zhi-Hua

doi:10.1007/s10994-016-5560-1

An empirical study on image bag generators for multi-instance learning

Published: 22 March 2016

Volume 105, pages 155–198, (2016)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

An empirical study on image bag generators for multi-instance learning

Download PDF

**u-Shen Wei¹ &
Zhi-Hua Zhou¹

4289 Accesses
Explore all metrics

Abstract

Multi-instance learning (MIL) has been widely used on diverse applications involving complicated data objects such as images, where people use a bag generator to represent an original data object as a bag of instances, and then employ MIL algorithms. Many powerful MIL algorithms have been developed during the past decades, but the bag generators have rarely been studied although they affect the performance seriously. Considering that MIL has been found particularly useful in image tasks, in this paper, we empirically study the utility of nine state-of-the-art image bag generators in the literature, i.e., Row, SB, SBN, k-meansSeg, Blobworld, WavSeg, JSEG-bag, LBP and SIFT. From the 6923 (9 bag generators, 7 learning algorithms, 4 patch sizes and 43 data sets) configurations of experiments we make two significant new observations: (1) Bag generators with a dense sampling strategy perform better than those with other strategies; (2) The standard MIL assumption of learning algorithms is not suitable for image classification tasks.

Two Efficient Image Bag Generators for Multi-instance Multi-label Learning

Instance-level accuracy versus bag-level accuracy in multi-instance learning

Article 07 May 2015

Multiple Instance Classification in the Image Domain

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In investigating the problem of drug activity prediction, Dietterich et al. (1997) proposed the notion of multi-instance learning (MIL). Contrasting to traditional single-instance learning, the multi-instance representation enables the learning process to exploit some inherent structure information in input patterns. Specifically, MIL receives a set of bags that are labeled positive or negative, rather than receiving a set of instances which have positive or negative labels. In addition, instances in MIL bags have no label information. The goal of MIL is to learn a classifier from the training bags such that it can correctly predict the labels of unseen bags.

With the rapid development of MIL, it has been widely used on diverse applications, especially the tasks involving complicated data objects, such as image categorization (Chen et al. 2006; Song et al. 2013), image annotation (Tang et al. 2010; Zhu and Tan 2011), image retrieval (Zhang et al. 2002; Zhou et al. 2003; Xu and Shih 2012), medical image diagnosis (Fung et al. 2007), face detection (Zhang and Viola 2008), text categorization (Andrews et al. 2003) and so on. Particularly, when we tackle the image tasks, an image can be naturally partitioned into several semantic regions; each region is represented as a feature vector (an instance). Consequently, MIL solutions have been recognized as state-of-the-art image categorization/annotation methods, particularly for region-based image categorization/annotation (Chen and Wang 2004; Yang et al. 2006; Vijayanarasimhan and Grauman 2008).

In practice, users first use a bag generator to represent an original data object as a bag of instances, and then apply MIL algorithms. It is noteworthy that the bag generators are different from the feature extraction process; that is, a bag generator decides how an image will be represented by a set of patches, whereas a feature extraction process decides how each patch is characterized by a feature vector. As there are many different ways for representing one data object into multiple instances, bag generators are crucial to MIL learning performances. However, the evaluation of different bag generators has rarely been studied, although many effective MIL learning algorithms have been developed during the past decades (representative examples include EM-DD (Zhang and Goldman 2000), Citation-kNN (Wang and Zucker 2000), RIPPER-MI (Chevaleyre and Zucker 2001), miSVM (Andrews et al. 2003), MIBoosting (Xu and Frank 2004), miGraph (Zhou et al. 2009), MILES (Chen et al. 2006), MIForests (Leistner et al. 2010), etc.).

In this paper, we focus on image categorization tasks and empirically investigate the properties of nine popular image bag generators, i.e., Row, SB, SBN (Maron and Ratan 2001), Blobworld (Carson et al. 2002), k-meansSeg (Zhang et al. 2002), WavSeg (Zhang et al. 2004), JSEG-bag (Liu et al. 2008), LBP (Ojala et al. 2002) and SIFT (Lowe 2004). Note that here we are studying what kind of bag generators are suitable to MIL algorithms, rather than studying general image classification approaches or novel image feature representations. Readers interested in those topics can refer to Felzenszwalb et al. (2010), Chatfield et al. (2011) and Nguyen et al. (2009).

Given an image for bag generators, they first separate the image into a number of regions, and then represent a region as an instance. Thus, by setting different patch sizes, a bag generator can obtain multiple bags with different number of instances for the same image. To examine the impact of bag generators on the classification performances with different learning algorithms, we employ seven state-of-the-art MIL methods as the test bed, including Citation-kNN (Wang and Zucker 2000), miSVM (Andrews et al. 2003), MIBoosting (Xu and Frank 2004), miGraph (Zhou et al. 2009), MILES (Chen et al. 2006), miFV (Wei et al. 2014) and MIForests (Leistner et al. 2010). Forty-three data sets with diverse target concepts are created from the COREL and MSRA image data sources. In all, by combining nine bag generators (five of them with different patch sizes, i.e., Row, SB, SBN, k-meansSeg and JSEG-bag), seven learning algorithms and forty-three data sets, we set up an extensive empirical study with 6923 configurations. From the experimental results, we have some important observations. Specifically, some bag generators (i.e., SB, SBN and LBP) with the dense sampling strategy will outperform other generators in most cases, which is consistent with the conclusion in computer vision (Li and Perona 2005; Nowak et al. 2006); miGraph, MIBoosting and miFV stress the relationship between instances in MIL bags, which do not adopt the standard MIL assumption (i.e., the bags are labeled positive in the way that if a bag contains at least one positive instance, otherwise it is labeled as negative), thus these learning algorithms could achieve better classification accuracy rates (cf. Table 5). Note that these two important observations have not been made before. Moreover, we analyze the utilities of these bag generators for different kinds of image classification tasks, i.e., scene classification and object classification. In addition, we also have some interesting findings about learning algorithms, and recommend several combinations of learning algorithm and bag generator for practical applications. In short, these observations, on one hand, give practical suggestions for bag generator selections with diverse needs, and on the other hand, they are insightful on designing better bag generators or MIL algorithms for image related tasks.

The rest of this paper is organized as follows. In Sect. 2, we briefly introduce multi-instance learning and the related works. In Sect. 3, we introduce some image bag generators, especially the ones which will be empirically studied later in this paper. We then give information about our empirical configurations in Sect. 4, including the learning algorithms, data sets and evaluation criteria used in our study. In Sect. 5, we present our empirical results. Finally, we summarize our main observations in Sect. 6 and conclude the paper.

2 Background

In the middle of the 1990s, Dietterich et al. (1997) investigated the problem of drug activity prediction. The goal was to use a model to predict whether a new molecule can be qualified to make some drug, through analyzing a set of known molecules. The difficulty of this problem was that, each molecule may have a wide range of possible types of low-energy shapes, but biochemistry experts at that time only knew which molecules were qualified to make drug, instead of knowing which special shapes played a decisive role.

In order to solve this problem, Dietterich et al. regarded each molecule as a bag, and regarded each kind of low-energy shapes of one molecule as an instance in its corresponding bag, thereby formulating multi-instance learning.

Formally, let $\mathcal {X}$ denote the instance space and $\mathcal {Y}$ the set of class labels. The task of multi-instance learning is to learn a function $f : 2^{\mathcal {X}} \rightarrow \{-1, +1\}$ from a given data set $\{(\mathbf {X}_1,y_1),(\mathbf {X}_2,y_2),\ldots ,(\mathbf {X}_m,y_m)\}$, where $\mathbf {X}_i \subseteq \mathcal {X}$ is a set of instances $\{\mathbf {x}_1^{(i)},\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_{n_i}^{(i)}\}$, $\mathbf {x}_{j}^{(i)} \in \mathcal {X} (j\in \{1,\ldots ,n_i\})$, and $y_i \in \{-1, +1\}$ is the known label of $X_i$. In contrast, the task of traditional supervised learning is to learn a function $f : \mathcal {X} \rightarrow \mathcal {Y}$ from a given data set $\{(\mathbf {x}_1,y_1),(\mathbf {x}_2,y_2),\ldots ,(\mathbf {x}_m,y_m)\}$, where $\mathbf {x}_i \in \mathcal {X}$ is an instance and $y_i \in \mathcal {Y}$ is the known label of $\mathbf {x}_i$.

Far beyond drug activity prediction, the multi-instance problem emerges naturally in a variety of challenging learning problems in image related tasks, including natural scene classification (Maron and Ratan 2001), image categorization/classification (Chen et al. 2006; Song et al. 2013), image annotation (Tang et al. 2010; Zhu and Tan 2011) and image retrieval (Zhang et al. 2002; Zhou et al. 2003; Xu and Shih 2012). In addition, MIL techniques have already been used on diverse applications, for example face detection (Zhang and Viola 2008; Viola et al. 2006), text categorization (Andrews et al. 2003; Settles et al. 2008), web mining (Zhou et al. 2005) and so on.

In the applications of multi-instance learning, Maron and Ratan (2001) were the first to apply the MIL framework to image tasks. In their work, several bag generators (i.e., Row, SB, SBN) for transforming images into image bags were presented and tested, and more importantly, they demonstrated that bag generators play a key role in develo** a practical CBIR system based on multi-instance learning and significantly affect the performance of the retrieval.

After that, several image bag generators have been proposed during the past decade. However, very little work has been done on the evaluation of image bag generators. Even worse, it is usually the case that researchers used a new bag generator without comparing to existing bag generators (e.g., Carson et al. (2002) and Zhang et al. (2004)).

Zhou et al. (2003) compared the performances of several different bag generators when they proposed the ImaBag image bag generator. In their experiments, they compared ImaBag with Maron and Ratan’s SBN (Maron and Ratan 2001) and Yang and Lozano-Pérez’s bag generator (Yang and Lozano-Pérez 2000), yet merely by employing the Diverse Density (Maron and Lozano-Pérez 1998) algorithm. Later they showed that the performances of ImaBag were worse than that of SBN, but were much better than that of Yang and Lozano-Pérez’s method. Additionally, Zhang et al. (2002) studied the performances of EM-DD (Zhang and Goldman 2000) across different image processing techniques based on SBN (Maron and Ratan 2001) and their k-meansSeg bag generator. However, they only reported k-meansSeg outperformed SBN in some cases. Compared to the previous work, we perform a very large number of experiments to study the utilities of nine state-of-the-art image bag generators and present an exhaustive evaluation of these image bag generators.

3 Image bag generators

Image bag generators extract information from an original image and then construct a set of instances which is regarded as an MIL bag. Depending on whether bag generators can distinguish the semantic components from images, image bag generators can be divided into two categories, i.e., non-segmentation bag generators and segmentation bag generators. Non-segmentation bag generators adopt a fixed strategy which is independent of image structures to extract instances from images. While segmentation bag generators try to segment an image into multiple semantic components, and construct MIL bags by using one instance to represent one corresponding semantic component.

In this paper, seven staple image bag generators are studied, including the simple method Row (Maron and Ratan 2001), the original image non-segmentation based methods SB and SBN (Maron and Ratan 2001), and the transformation based methods Blobworld (Carson et al. 2002), k-meansSeg (Zhang et al. 2002), WavSeg (Zhang et al. 2004) and JSEG-bag (Liu et al. 2008). Among which, Row, SB and SBN are non-segmentation bag generators, and Blobworld, k-meansSeg, WavSeg and JSEG-bag are segmentation bag generators. In addition, some local descriptors^{Footnote 1} in computer vision have been frequently applied to generate bags for MIL in recent years. Therefore, we employ two famous local descriptors, i.e., local binary patterns (LBP) (Ojala et al. 2002) and scale invariant feature transform (SIFT) (Lowe 2004), as bag generators to extract sets of features from images. Detailed descriptions of these methods will be shown in the next subsections, followed by a brief introduction to some other bag generators and finally an abbreviated conclusion of these image bag generators.

3.1 Row, SB and SBN

Maron and Ratan (2001) proposed five bag generators with the same preprocessing steps. The most popular three among them are Row, SB and SBN which all work in the RGB color space.

Row. For an $8\times 8$ filtered image, as Fig. 1a demonstrates, the bag is constructed as following: for each row, one instance is constructed by the mean color of this row and the mean color difference in the rows above and below it.

The SB bag generator is short for Single Blob with no neighbors. As shown in Fig. 1b, each instance is a $2\times 2$ sized blob of the original image. Note that there is no overlap** between pairwise blobs.

The SBN bag generator is short for Single Blob with Neighbors, which takes the relationship between neighboring blobs into account. Each instance is constructed as the mean color value of a $2\times 2$ blob and the color difference with its four neighboring blobs. See Fig. 1c.

Note that the key difference between SBN and SB lies in that SBN has overlap** blobs. Figure 2 is an example of the bag generating process of SBN for an $8\times 8$ image. Here each blob is a $2\times 2$ image patch and the sliding window is cross-shaped as presented in Fig. 2a. During the bag generation process, when the sliding window moves one pixel from left to right, SBN abstracts one instance from the image in the current position at that time. Thus, as shown from Fig. 2a–c, the abstracted instances overlap with each other.

Additionally, there also exists another difference between SB and SBN. SB traverses the whole image with no blind zones, while SBN produces blind zones on the corners when sliding window moves to the edge of the image, which could contribute nothing for image representation. In Fig. 2c, the northeast and northwest blobs highlighted with red rectangles are two blind zones.

3.2 Blobworld

The process of Blobworld (Carson et al. 2002) is as follows. First, Blobworld extracts each image pixel’s color feature which has a three-dimensional color descriptor in the L*a*b* color space.^{Footnote 2} Second, it extracts texture features from the grayscale images, which are to get the anisotropy, the contrast and the polarity. So far the color/texture descriptor for a given pixel consists of six values: Three for color and three for texture. In the third step, we append the (x, y) position of the pixel to the previous feature vector. After obtaining the pixel features of 8-dimension, Blobworld groups pixels into regions by modeling the distribution of pixel features with a mixture of Gaussians. In order to divide these pixels into groups, it uses the Expectation-Maximization (EM) algorithm to estimate the maximum likelihood parameters of this mixture of K Gaussian components. Finally, Blobworld describes the color distribution and texture of each region for the MIL algorithms, i.e., the representation of each region in an image is one instance in a bag. The stages of Blobworld processing are depicted in Fig. 3.

3.3 k-meansSeg

Zhang et al. (2002) proposed the k-meansSeg bag generator method when they studied content-based image retrieval. In k-meansSeg, images are executed in the YCbCr color space^{Footnote 3} without any preprocessing. It defines a $4\times 4$ image patch as a blob, represented by a six-dimensional vector. The first three dimensions are the mean values of three color components of these 16 $(4\times 4)$ pixels, and the latter three dimensions are composed by three sub-bands, i.e., HL, LH and HH, which are obtained by the Daubechies-4 wavelet transformation on the luminance (Y) component. Thus, the blobs of an original image can be expressed as:

$$\begin{aligned} blobs=\left\{ \left\langle Y_i,Cb_i,Cr_i,HL(Y)_i,LH(Y)_i,HH(Y)_i \right\rangle \mid i=1,2,\ldots ,n \right\} \end{aligned}$$

where n is the number of $4\times 4$ blobs.

After that, the k-means segmentation algorithm is employed on these six-dimensional vectors to segment the image into K segments, and thus one segment is corresponding to one instance. The unknown parameter K is set as 2 at the beginning of this method, then increases by cycling, and terminates until its stop conditions (Zhang et al. 2002). Finally, the ith instance in bags is obtained by averaging all the six-dimensional vectors representing all blobs in the ith segment:

$$\begin{aligned} bag&=\left\{ \langle mean(Y_{ij}),mean(Cb_{ij}),mean(Cr_{ij}),\right. \\&\left. mean(HL(Y)_{ij}),mean(LH(Y)_{ij}),mean(HH(Y)_{ij}) \rangle \mid i=1,2,\ldots ,K \right\} \end{aligned}$$

where K is the number of segments and j’s are all blobs of the ith segment.

3.4 WavSeg

Zhang et al. (2004) proposed the WavSeg bag generator to automatically construct multiple instances (regions) in MIL bags (images). WavSeg mainly involves the wavelet analysis and the Simultaneous Partition and Class Parameter Estimation (SPCPE) algorithm (Chen et al. 2000). In the first step, the images are preprocessed by the Daubechies-1 wavelet transform. After the wavelet transformation, the high-frequency components will disappear in larger scale subbands and therefore, possible regions will be clearly evident. Then by grou** the salient points from each channel, an initial coarse partition can be obtained and passed as the input to the SPCPE segmentation algorithm. In Zhang et al. (2004), they showed that the wavelet transform could lead to better segmentation results, and additionally, it can produce other useful features such as texture features. In the following, WavSeg extracts both the local color and local texture features for each image region. When extracting the color features, they quantize the color space by using color categorization based on HSV value ranges (totally 13 representative colors for these ranges) of the HSV color space.^{Footnote 4} For the regions’ texture features, the Daubechies-1 transform could generate three corresponding images in three frequency bands (i.e., HL, LH and HH) of the original image. For the wavelet coefficients in each of the above three bands, the mean and variance values are collected respectively. Therefore, the total six texture features are generated for each image region. The form of the bag generated by WavSeg is as follows:

$$\begin{aligned} bag= \{&\langle hist_1,hist_2,...,hist_{13},\\&mean(HL_1),var(HL_1),mean(LH_1),var(LH_1),\\&mean(HH_1),var(HH_1),mean(HL_2),var(HL_2),\\&mean(LH_2),var(LH_2),mean(HH_2),var(HH_2) \rangle \} \end{aligned}$$

3.5 JSEG-bag

Liu et al. (2008) proposed two bag generators. One of them is named JSEG-bag, which is based on the JSEG image segmentation algorithm (Deng and Manjunath 2001). The other one is named Attention-bag and based on the salient point-based technique. However, in their experiments, the results of Attention-bag are worse than those of SBN, so we only consider JSEG-bag.

Because JSEG-bag is based on the JSEG algorithm, we first introduce that algorithm. Deng and Manjunath (2001) presented the JSEG image segmentation algorithm for unsupervised segmentation of color-texture regions in images and videos. The method consists of two independent steps: color quantization and spatial segmentation. In the first step, colors in an image are quantized into several representative classes that can be used to differentiate regions in the image. This quantization is performed in the color space alone without considering the spatial distributions. Afterwards, image pixel colors are replaced by their corresponding color class labels, thus forming a class-map of the image. The second step is the spatial segmentation on the class-map of the image.

In JSEG-bag, it firstly segments an image with the JSEG algorithm (Deng and Manjunath 2001). Then it selects the top k regions from the segmented image in order of decreasing regions’ areas. Note that in our experiments, we vary the different values of k as 2, 6, and 10. In the third step of JSEG-bag, it computes the R, G and B color mean values of each region. Eventually, the image is converted into a corresponding image bag consisting of k 3-dimensional feature vectors (instances). The segmented result is shown in Fig. 6g.

3.6 Local binary patterns

Local binary pattern (LBP) (Ojala et al. 2002) is a local descriptor that captures the appearance of an image in a small neighborhood around a pixel. An LBP is a string of bits, with one bit for each of the pixels in the neighborhood. Each bit is turned on or off depending on whether the intensity of the corresponding pixel is greater than the intensity of the central pixel. Usually, these binary strings are pooled in local histograms, rather than directly using the binary strings.

The LBP in our experiments is from the open source library VLFeat.^{Footnote 5} In VLFeat, it implements only the case of $3\times 3$ pixel neighborhoods which is found to be optimal in several applications. In particular, as shown in Fig. 4, the LBP centered on pixel (x, y) is a string of eight bits. Each bit is equal to one if the corresponding pixel is brighter than the central one. Pixels are scanned starting from the one to the right in anti-clockwise order. For a $3\times 3$ neighborhood, an LBP is a string of eight bits and so there are 256 possible LBPs. In practice, the 256 patterns are further quantized into a 58 quantized patterns according to the uniform patterns (Heikkilä and Pietikäinen 2006). The quantized LBP patterns are further grouped into local histograms. In our experiments, we divide an image with $40\times 40$ pixel windows. Then the quantized LBPs in each window are aggregated into a histogram by using bilinear interpolation along the two spatial dimensions. Thus, the bag generated by LBP from a $240\times 360$ image has totally 54 ($(240/40)\times (360/40) = 6\times 9$) instances with 58 dimensions.

3.7 Scale invariant feature transform

A scale invariant feature transform (SIFT) feature (Lowe 2004) is a 3D spatial histogram of the image gradients in characterizing the appearance of an image keypoint. The first thing of computing SIFT descriptors is to extract SIFT keypoints, for whose details please refer to Lowe (2004). After collecting N SIFT keypoints, as shown in Fig. 5, for each SIFT keypoint, we compute the gradient magnitude and orientation at each image sample point in an image patch. These samples are weighed by the gradient norm and accumulated in a 3D histogram h, which forms the SIFT descriptor of the image patch. An additional Gaussian weighting function is applied to give less importance to gradients farther away from the keypoint center. Orientations are quantized into 8 bins and the spatial coordinates into four each. Therefore, the resulting SIFT descriptor is of dimension 128 ($8\text { bins} \times 4\times 4 = 128\text { bins}$). Note that, Fig. 5 just shows a $2\times 2$ descriptor array computed from an $8\times 8$ set of samples. In consequence, the bag generated by SIFT contains N instances of 128 dimensions.

3.8 Other image bag generators

Yang and Lozano-Pérez (2000) developed a bag generator called PRegions which is based on possible regions of images. This bag generator sets a list of possible regions of interest (ROI) in advance. After that, an image is divided into such overlap** regions. Each region is filtered and converted into a feature vector. In this way, the image is represented by a set of feature vectors. However, PRegions is very time-consuming and its performance is mediocre (Zhou et al. 2003).

In Zhou et al. (2003), a bag generator named ImaBag was presented. In the first step of this method, image pixels are clustered based on their colored and spatial features, where the clustering process is accomplished by a SOM neural network. Then, the clustered blocks are transformed into a specific number of regions by eliminating isolated pixels and merging scattered blocks. Finally, the resulting regions are converted into three-dimensional numerical instances of the image bag formed by their mean R, G, B values. Note that performance of the two bag generators is much worse than that of SBN, as reported in Zhou et al. (2003), and this is the reason why we evaluate the other seven bag generators without these two.

3.9 Recap of bag generators

We have described specific techniques and computations performed by the nine state-of-the-art bag generators. In this section, we will provide a brief comparison and some conclusions about them.

As mentioned earlier, Row, SB and SBN are three non-segmentation bag generators only extracting color features. They segment the original images into multiple regions by using fixed strategies, which might divide objects into several parts. And it might be disadvantageous for SBN: its overlap** strategy could make one object in an original image (bags) be presented in multiple regions (instances) many times, which appears to be problematic based on the results shown in Sect. 5.2.1.

Blobworld, k-meansSeg, WavSeg and JSEG-bag are segmentation bag generators. They are similar in that they first segment original images into multiple regions (instances), and then extract features for presenting each local region. The different points among them are their different segmentation approaches. Blobworld and k-meansSeg extract the pixel-level or blob-level features firstly. After that, they cluster these pixels or blobs into several regions (instances), i.e., Gaussian Mixture Model of Blobworld and k-means of k-meansSeg. Finally, for each region, they compute the average value of pixels’ or blobs’ features in the same one region as the regions’ features. WavSeg and JSEG-bag employ the SPCPE and JSEG segmentation algorithms, respectively, to segment original images. The final step of these two is extracting features from the multiple regions. In short, k-meansSeg and WavSeg contain both color and texture information of each region, and apart from this, Blobworld also contains the spatial information. However, JSEG-bag merely has color information. The segmentation results of different segmentation bag generators are shown in Fig. 6.

LBP and SIFT are two famous local descriptors employed as bag generators in this paper. They both compute the histogram-based features of local regions (instances) in images (bags). An important thing is that they both process the grayscale images, therefore their local features (i.e., the bits strings of LBP and the gradient distributions of SIFT) only contain texture information, without any color information.

In addition, from the view of sampling strategies, it is obvious to find that SB, SBN and LBP sample dense patches/regions to construct instances in bags. However, the SIFT descriptor (instance) is just based on the keypoints detected by SIFT detectors, rather than sampling dense local regions from original images. Moreover, the other bag generators only treat image segments as instances.

4 Empirical configurations

In this section, we first introduce seven state-of-the-art multi-instance learning algorithms used in our experiments. Then we describe the data sets and the evaluation criteria used in our empirical study.

4.1 Learning algorithms

Since the investigation of the drug activity prediction problem, many MIL algorithms have been proposed. According to a recent MIL review (Amores 2013), MIL algorithms are grouped into three paradigms: (a) the Instance-Space (IS) paradigm; (b) the Bag-Space (BS) paradigm; (c) the Embedded-Space (ES) paradigm, based on how they manage the information from the multi-instance data. In short, for the IS paradigm, the discriminative information is considered to lie at the instance-level, while in the BS paradigm, the discriminative information is at the bag level. The MIL algorithms in ES explicitly or implicitly map each MIL bag to a single feature vector which summarizes the relevant information about the whole bag.

Therefore, we select the corresponding presentative learning algorithms from each paradigm, respectively, which are: Citation-kNN (Wang and Zucker 2000) and miGraph (Zhou et al. 2009) for the BS paradigm; MIBoosting (Xu and Frank 2004), miSVM (Andrews et al. 2003) and MIForests (Leistner et al. 2010) for the IS paradigm; MILES (Chen et al. 2006) and miFV (Wei et al. 2014) for the ES paradigm.

In addition, the assumptions in these MIL algorithms can be divided into two groups, i.e., the standard MIL assumption and the relaxed MIL assumptions. The standard MIL assumption is that a bag is positive if and only if one or more of its instances are positive, otherwise it is labeled negative. In this paper, Citation-kNN, miSVM, MILES and MIForests obey the standard assumption. The relaxed assumptions stress the relationship between instances in a bag in determining the bag’s label, rather than one key instance can determine the bag’s label assumed in the standard assumptions. For example, miGraph treated instances in each MIL bag as non-i.i.d. samples (Zhou et al. 2009); MIBoosting assumed that all instances contributed equally and independently to a bag’s class label (Xu and Frank 2004); and miFV grouped instances in bags and encoded them into a new feature representation with the bag-level discriminative information, which implicitly assumed the instances in bag are non-i.i.d. (Wei et al. 2014). In the following, we try to give the key points about these MIL algorithms in this section.

Citation-kNN was proposed by Wang and Zucker (2000). The minimum Hausdorff distance was used as the bag-level distance metric, which is why we consider it to be a BS paradigm learning algorithm. In addition, when predicting the label of a new bag, Citation-kNN considers not only the bags that are the nearest neighbors of the new bag, but also the bags that count the new bag as their neighbor.

miGraph. Most previous MIL algorithms treat instances in the bags as independently and identically distributed. Considering that the instances in a bag are rarely independent in real tasks, Zhou et al. (2009) proposed miGraph to solve MIL problems by treating instances as non-i.i.d. samples. Their basic idea is to regard a bag as an entity to be processed as a whole (which demonstrates that it is a BS paradigm algorithm), and instances as inter-correlated components of the entity. miGraph implicitly constructs graphs by deriving affinity matrices and defines an efficient graph kernel considering the clique information.

MIBoosting. In contrast to the standard MIL assumption that there exists one or several “key” instances triggering the bag labels, MIBoosting (Xu and Frank 2004) assumes that all instances contribute equally and independently to a bag’s label. Naturally, the process of predicting the label of a bag is conducted in two stages. In the first stage, MIBoosting determines each individual instance’s class probabilities in a bag. And then, it combines these estimates to assign a class label to the bag in the second stage, which shows that it is an IS paradigm algorithm.

miSVM (Andrews et al. 2003) was designed for the instance-level classification problem. miSVM explicitly treats instance-level labels as unobserved integer variables, subjected to constraints defined by their bag-level labels. Intuitively, miSVM tries to look for an MI-separating linear discriminant such that at least one instance from every positive bag locates in the positive half-space, while all instances from negative bags locate in the negative half-space. Obviously, miSVM belongs to the IS paradigm.

MILES (Chen et al. 2006) converts MIL problems to standard supervised learning by embedding bags into an instance-based feature space (implicitly map**) and selecting the most important features. They define a similarity measure between a bag and an instance. The coordinates of a given bag in the feature space represent the bag’s similarities to various instances in the training set. At last, the 1-norm SVM is used to construct classifiers and select important features simultaneously.

miFV (Wei et al. 2014) is one kind of ES method with a vocabulary, and it is an efficient and scalable MIL algorithm. In miFV, the instances in MIL bags are first clustered into several “groups”, and then mapped by its map** function (explicitly map**) into a new feature vector representation (i.e., Fisher Vector (Sánchez et al. 2013)) with the bag-level label, which implicitly assumes the instances are non-i.i.d. samples. Note that miFV encodes instances in bags into a bag-level feature vector, rather than embedding bags into an instance-based feature space which is done in MILES.

MIForests (Leistner et al. 2010) brings the advantage of random forests to multi-instance learning. MIForests treats the (hidden) labels of instances as random variables defined over a space of probability distributions, which is obviously an IS paradigm algorithm. Thus, they formulate multi-instance learning as an optimization procedure where the labels of the instances become the optimization variables. After that, they disambiguate the instance labels by iteratively searching for distributions that minimize the overall learning objective.

Note that, in our experiments, we selected the corresponding optimal parameters of these learning algorithms via cross validations on their training data. The specific information of parameters can be found in Section II of the Appendix.

4.2 Data sets

The data sets used in our experiments are taken from COREL and MSRA image data sources which are very representative and frequently used in many image tasks of MIL researches.

COREL images consist of 2000 images from 20 CD-ROMs published by the COREL Corporation, and thus contain 20 categories where each category contains 100 images.^{Footnote 6} Images are in the JPEG format of $384\times 256$ or $256\times 384$ image resolution. The category names are listed in Table 1 along with the identifiers for these 20 categories. Figure 7 shows some sample images from COREL images.

MSRA images are the second version of MSRA-MM data set (Li et al. 2009). The image data set was collected from Microsoft Live Search and it contains about 1 million images manually classified into 8 categories, i.e., Animal, Cartoon, Event, Object, Scene, PeopleRelated, NamedPerson, and Misc. We here select 10 sub-categories from them and each sub-category has 500 images. Note that instead of the standard original image resolution in COREL images, these images from MSRA images are in different image resolutions. The sub-category names of MSRA images are listed in Table 2. Some sample images from MSRA images are shown in Fig. 8.

Table 1 COREL images

An empirical study on image bag generators for multi-instance learning

Abstract

Similar content being viewed by others

Two Efficient Image Bag Generators for Multi-instance Multi-label Learning

Instance-level accuracy versus bag-level accuracy in multi-instance learning

Multiple Instance Classification in the Image Domain

1 Introduction

2 Background

3 Image bag generators

3.1 Row, SB and SBN

3.2 Blobworld

3.3 k-meansSeg

3.4 WavSeg

3.5 JSEG-bag

3.6 Local binary patterns

3.7 Scale invariant feature transform

3.8 Other image bag generators

3.9 Recap of bag generators

4 Empirical configurations

4.1 Learning algorithms

4.2 Data sets

4.3 Evaluation criteria

4.3.1 Accuracy

4.3.2 The Friedman test

5 Empirical results

5.1 Accuracy comparison

5.1.1 From the bag generator view

5.1.2 From the learning algorithm view

5.1.3 Recommended combinations

5.2 Method observations

5.2.1 Similar performance phenomenon of SB and SBN

5.2.2 Observations about patch size

6 Summary of findings

7 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Detailed experimental results

Appendix: Detailed experimental results

1.1 Section I: The detailed averagy accuracy rates of each combination (bag generator + learning algorithm) on Elephant, Fox, Tiger, 1000-Image, 2000-Image and MSRA

1.2 Section II: The corresponding optimal parameters for each combination on the 2000-Image and MSRA data collections

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation