Keywords

1 Introduction

In the field of computer vision, food recognition has caused a lot of interest for researchers considering its applicability in solutions that improve peopleā€™s nutrition and hence, their lifestyle [1]. In relation to the healthy diet, traditional strategies for analyzing food consumption are based on self-reporting and manual quantification [2]. Hence, the information used to be inaccurate and incomplete [3]. Having an automatic monitoring system and being able to control the food consumption is of vital importance, especially for the treatment of individuals who have eating disorders, want to improve their diet or reduce their weight.

Food recognition is a key element within a food consumption monitoring system. Originally, it has been approached by using traditional approaches [4, 5], which extracted ad-hoc image features by means of algorithms based mainly on color, texture and shape. More recently, other approaches focused on using Deep Learning techniques [5,6,7,8]. In these works, feature extraction algorithms are not hand-crafted and additionally, the models automatically learn the best way to discriminate the different classes to be classified. As for the results obtained, there is a great difference (more than 30%) between the best method based on hand-crafted features compared to newer methods based on Deep Learning, where the best results have been obtained with Convolutional Neural Networks (CNN) architectures that used inception modules [8] or residual networks [7].

Fig. 1.
figure 1

Example images of Food-101 dataset. Each image represents a dish class.

Food recognition can be considered as a special case of object recognition, being a very active topic in computer vision lately. The specific part is that dish classes have a much higher inter-class similarity and intra-class variation than usual Imagenet objects (cars, animals, rigid objects, etc.) (see Fig.Ā 1). If we analyze the last accuracy increase in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9], it has been improved thanks to the depth increase of CNN models [10,11,12,13] and also to the fusion of CNNs models [11, 13]. The main problem of CNNs is the need of large datasets to avoid overfitting the network as well as the need of high computational power for training them.

Considering the use of different classifiers, in general, trained on the same data, one can observe that patterns misclassified by the different models would not necessarily overlap [14]. This suggests that they could potentially offer complementary information that can be used to improve the final performance [14]. An option to combine the outputs of different classifiers was proposed in [15], where the authors used what they call a decision templates scheme instead of simple aggregation operators such as the product or average. As they showed, this scheme maintains a good performance using different training set sizes and is also less sensitive to particular datasets compared to the other schemes.

In this article, we integrate the fusion concept into the CNN framework, with the purpose of demonstrating that the combination of the classifiersā€™ output, by using a decision template scheme, allows to improve the performance on the food recognition problem. Our contributions are the following: (1) we propose the first food recognition algorithm that fuses the output of different CNN models, (2) we show that our CNNs fusion approach has better performance compared to the use of CNN models separately, and (3) we demonstrate that our CNNs Fusion approach keeps a high performance independently of the target (dishes, family of dishes) and dataset validating it on 2 public datasets.

The organization of the article is as follows. In Sect.Ā 2, we present the CNNs Fusion methodology. In Sect.Ā 3, we present the datasets, the experimental setup and discuss the results. Finally, in Sect.Ā 4, we describe the conclusions.

Fig. 2.
figure 2

General scheme of our CNNs fusion approach.

2 Methodology

In this section, we describe the CNN Fusion methodology (see Fig.Ā 2), which is composed of two main steps: training K CNN models based on different architectures and fusing the CNN outputs using the decision templates scheme.

2.1 Training of CNN Models

The first step in our methodology involves separately training two CNN models. We chose two different kind of models winners of the ILSVRC in the object recognition task. Both models won or are based on the winner of the challenges made in 2014 and 2015 proposing novel architectures: the first based its design on ā€œinception modelsā€ and the second on ā€œresidual networksā€. First, each model was pre-trained on the ILSVRC data. Later, all layers were fine-tuned by a certain number of epochs, selecting for each one the model that provides the best results in the validation set and that will be used in the fusion step.

2.2 Decision Templates for Classifiers Fusion

Once we trained the models on the food dataset, we combined the softmax classifier outputs of each model using the Decision Template (DT) scheme [15].

Let us annotate the output of the last layer of the k-th CNN model as \((\omega _{1,k},\ldots ,\omega _{C,k})\), where \(c=1,\ldots ,C\) is the number of classes and \(k=1,\ldots K\) is the index of the CNN model (in our case, K\(=2\)). Usually, the softmax function is applied, to obtain the probability value of model k to classify image x to a class c: \( p_{k,c}(x) = \frac{e^{\omega _{k,c}}}{\sum _{c=1}^{C} {e}^{\omega _{k,c}}}. \) Let us consider the k-th decision vector \(D_{k}\):

$$ D_{k}(x) = [p_{k,1}(x), p_{k,2}(x),\ldots , p_{k,C}(x)] $$

Definition [15]: A Decision Profile, DP for a given image x is defined as:

$$\begin{aligned} DP(x) = \left[ \begin{array}{cccc} p_{1,1}(x) &{} p_{1,2}(x) &{}\ldots &{} p_{1,C}(x) \\ &{} \ldots &{} \\ p_{K,1}(x) &{} p_{K,2}(x) &{}\ldots &{} p_{K,C}(x) \end{array} \right] \end{aligned}$$
(1)

Definition [15]: Given N training images, a Decision Template is defined as a set of matrices \(DT=(DT^1,\ldots ,DT^C)\), where the c-th element is obtained as the average of the decision profiles (1) on the training images of class c:

$$ DT^{c} = \frac{\sum _{j =1}^N {DP(x_j)\times Ind(x_j,c)}}{\sum _{j=1}^N Ind(x_j,c)}, $$

where \(Ind(x_j,c)\) is an indicator function with value 1 if the training image \(x_j\) has a crisp label c, and 0, otherwise [16].

Finally, the resulting prediction for each image is determined considering the similarity \(s(DP(x),DT^c(x))\) between the decision profile DP(x) of the test image and the decision template of class \(c,\) \(c=1,\ldots ,C\). Regarding the arguments of the similarity function s(.,.) as fuzzy sets on some universal set with \(K \times C\) elements, various fuzzy measures of similarity can be used. We chose different measures [15], namely 2 measures of similarity, 2 inclusion indices, a consistency measure and the Euclidean Distance. These measures are formally defined as:

$$ S_{1}(DT^{c}, DP(x)) = \frac{\sum _{k=1}^K\sum _{i=1}^C \min (DT_{k,i}^c, DP_{k,i}(x))}{\sum _{k=1}^K\sum _{i=1}^C \max (DT_{k,i}^c, DP_{k,i}(x))}, $$
$$ S_{2}(DT^{c}, DP(x)) = 1- \sup _u\{\left| DT_{k,i}^c-DP_{k,i}(x)\right| : c=1,\ldots ,C , k=1,\ldots ,K\}, $$
$$ I_{1}(DT^{c}, DP(x)) = \frac{\sum _{k=1}^K\sum _{i=1}^C \min (DT_{k,i}^c, DP_{k,i}(x))}{\sum _{k=1}^K\sum _{i=1}^C DT_{k,i}^c}, $$
$$ I_{2}(DT^{c}, DP(x)) = \inf _u\{\max (\overline{DT_{k,i}^c},DP_{k,i}(x)) : c=1,\ldots ,C, k=1,\ldots ,K\}, $$
$$ C(DT^{c}, DP(x)) = \sup _u\{\min (DT_{k,i}^c,DP_{k,i}(x)) : c=1,\ldots ,C, k=1,\ldots ,K\}, $$
$$ N(DT^{c}, DP(x)) = 1 - \frac{\sum _{k=1}^K\sum _{i=1}^C (DT_{k,i}^c-DP_{k,i}(x))^2}{K \times C}, $$

where \(DT_{k,i}^c\) is the probability assigned to the class i by the classifier k in the \(DT^c\), \(\overline{DT_{k,i}^c}\) is the complement of \(DT_{k,i}^c\) calculated as \(1-DT_{k,i}^c\), and \(DP_{k,i}(x)\) is the probability assigned by the classifier k to the class i in the DP calculated for the image, x. The final label, L is obtained as the class that maximizes the similarity, s, the inclusion index, the consistency measure or the Euclidean distance between DP(x) and \(DT^c\): \(L(x) = argmax _{c=1,\ldots ,C}\{s(DT^c, DP(x))\}.\)

3 Experiments

3.1 Datasets

The data used to evaluate our approach are two public datasets of very different images: Food-11 [17] and Food-101 [4], which are chosen in order to verify that the classifiers fusion provides good results regardless of the different properties of the target datasets, such as intra-class variability (the first one is composed of many dishes of the same general category, while the second one is composed of specific fine-grained dishes), inter-class similarity, number of images, number of classes, images acquisition condition, among others.

Food-11 is a dataset for food recognition [17], which contains 16,643 images grouped into 11 general categories of food: bread, dairy products, dessert, egg, fried food, meat, noodle/pasta, rice, seafood, soup and vegetable/fruit (see Fig.Ā 3). The images were collected from existing food datasets (Food-101, UECFOOD100, UECFOOD256) and social networks (Flickr, Instagram). This dataset has an unbalanced number of images for each class with an average of 1,513 images per class and a standard deviation of 702. For our experiments, we used the same data split, images and proportions, provided by the authors [17]. These are divided as 60% for training, 20% for validation and 20% for test, that is 9,866, 3,430 and 3,347 images for each set, respectively.

Fig. 3.
figure 3

Images from the Food-11 dataset. Each image corresponds to a different class.

Food-101 is a standard to evaluate the performance of visual food recognition [4]. This dataset contains 101.000 real-world food images downloaded from foodspotting.com, which were taken under unconstrained conditions. The authors chose the top 101 most popular classes of food (see Fig.Ā 1) and collected 1,000 images for each class: 75% for training and 25% for testing. With respect to the classes, these consist of very diverse and fine-grained dishes of various countries, but also with highly intra-class variation and inter-class similarity in most occasions. In our experiments, we used the same data splits provided by the authors. Unlike Food-11, and kee** the procedure followed by other authors [5, 8], we enhanced them by means of a series of random distortions such as: adjusting color balance, contrast, brightness and sharpness. Finally, we made random crops of the images, with a dimension of 299\(\,\times \,\)299 for InceptionV3 and of 224\(\,\times \,\)224 for the other models. Then, we applied random horizontal flips with a probability of 50%, and subtracted the average image value of the ImageNet dataset. During validation, we applied a similar preprocessing, with the difference that we made a center crop instead of random crops and that we did not apply random horizontal flips. During test, we followed the same procedure than in validation (1-Crop evaluation). Furthermore, we also evaluated the CNN using 10-Crops, which are: upper left, upper right, lower left, lower right and center crop, both in their original setup and also applying an horizontal flip [10]. As for 10-Crops evaluation, the classifier gets a tentative label for each crop, and then majority voting is used over all predictions. In the cases where two labels are predicted the same number of times, the final label is assigned comparing their highest average prediction probability.

We used four metrics to evaluate the performance of our approach, overall Accuracy (ACC), Precision (P), Recall (R), and \(F_{1}\) score.

3.4 Experimental Results on Food-11

The results obtained during the experimentation on Food-11 dataset are shown in TableĀ 1 giving the error rate (1 - accuracy) for the best CNN models, compared to the CNNs Fusion. We report the overall accuracy by processing the test data using two procedures: (1) a center crop (1-Crop), and (2) using 10 different crops of the image (10-Crops). The experimental results show an error rate of less than 10 % for all classifiers, achieving a slightly better performance when using 10-Crops. The best accuracy is achieved with our CNNs Fusion approach, which is about 0.75% better than the best result of the classifiers evaluated separately. On the other hand, the baseline classification on Food-11 was given by their authors, who obtained an overall accuracy of 83.5% using GoogLeNet models fine-tuned in the last six layers without any pre-processing and post-processing steps. Note that the best results obtained with our approach have been using the pointwise measures (S2, I2). The particularity of these measures is that they penalize big differences between corresponding values of DTs and DP being from the specific class to be assigned as the rest of the class values. From now on, in this section we only report the results based on the 10-Crops procedure.

Table 1. Overall test set error rate of Food-11 obtained for each model. The distance measure is shown between parenthesis in the CNNs Fusion models.

As shown in TableĀ 2, the CNNs Fusion is able to properly classify not only the images that were correctly classified by both baselines, but in some occasions also when one or both fail. This suggests that in some cases both classifiers may be close to predicting the correct class and combining their outputs can make a better decision.

Table 2. Percentage of images well-classified and misclassified on Food-11 using our CNNs Fusion approach, distributed by the results obtained with GoogLeNet (CNN\(_{1}\)) and ResNet50 (CNN\(_{2}\)) models independently evaluated.

Samples misclassified by our model are shown in Fig.Ā 4, where most of them are produced by mixed items, high inter-class similarity and wrongly labeled images. We show the ground truth (top) and the predicted class (bottom) for each sample image.

Fig. 4.
figure 4

Misclassified Food-11 examples: predicted labels (on the top), and the groundtruth (on the bottom).

In TableĀ 3, we show the precision, recall and \(F_1\) score obtained for each class separately. By comparing the \(F_1\) score, the best performance is achieved for the class Noodles_Pasta and the worst for Dairy products. Specifically, the class Noddles_Pasta only has one image misclassified, which furthermore is a hard sample, because it contains two classes together (see items mixed in Fig.Ā 4). Considering the precision, the worst results are obtained for the class Bread, which is understandable considering that bread can sometimes be present in other classes (e.g. soup or egg). In the case of recall, the worst results are obtained for Dairy products, where an error greater than 8% is produced for misclassifying several images as class Dessert. The cause of this is mainly, because the class Dessert has a lot of items in their images that could also belong to the class Dairy products (e.g. frozen yogurt or ice cream) or that are visually similar.

Table 3. Some results obtained on the Food-11 using our CNNs Fusion approach.

3.5 Experimental Results on Food-101

The overall accuracy on Food-101 dataset is shown in TableĀ 4 for two classifiers based on CNN models, and also for our CNNs Fusion. The overall accuracy is obtained by means of the evaluation of the prediction using 1-Crop and 10-Crops. The experimental results show better performance (about 1% more) using 10-Crops instead of 1-Crop. From now on, in this section we only report the results based on the 10-Crops procedure. In the same way as observed in Food-11, the best accuracy obtained with our approach was by means of point-wise measures S2, I2, where the latter provides a slightly better performance. Again, the best accuracy is also achieved by the CNNs Fusion, which is about 1.5% higher than the best result of the classifiers evaluated separately. Note that the best performance on Food-101 (overall accuracy of 90.27%) was obtained using WISeR [7]. In addition, the authors show the performance by another deep learning-based approaches, in which three CNN models achieved over a 88% (InceptionV3, ResNet200 and WRN [19]). However, WISeR, WRN and ResNet200 models were not considered in our experiments since they need a multi-GPU server to replicate their results. In addition, those models have 2.5 times more parameters than the models chosen, which involve a high cost computational especially during the learning stage. Following the article steps, our best results replicating the methods were those using InceptionV3 and ResNet50 models used as a base to evaluate the performance of our CNNs Fusion approach.

Table 4. Overall test set accuracy of Food-101 obtained for each model.

As shown in TableĀ 5, in this dataset the CNNs Fusion is also able to properly classify not only the images that were correctly classified for both classifiers, but also when one or both fail. Therefore, we demonstrate that our proposed approach maintains its behavior independently of the target dataset.

Table 5. Percentage of images well-classified and misclassified on Food-101 using our CNNs Fusion approach, distributed by the results obtained with InceptionV3 (CNN\(_{1}\)) and ResNet50 (CNN\(_{2}\)) models independently evaluated.

TableĀ 6 shows the top five worst and best classification results on Food-101 classes. We highlight the classes with the worst and best results. As for the worst class (Steak), the precision and recall achieved are 60.32% and 59.60%, respectively. Interestingly, about 26% error in the precision and 30% error in the recall is produced with only three classes: Filet mignon, Pork chop and Prime rib. As shown in Fig.Ā 5, these are fine-grained classes with high inter-class similarities that imply high difficulty for the classifier, because it should identify small details that allow to determine the corresponding class of the images. On the other hand, the best class (Edamame) was classified achieving 99.60% of precision and 100% of recall. Unlike Steak, Edamame is a simple class to classify, because it has a low intra-class variation and low inter-class similarities. In other words, the images in this class have a similar visual appearance and they are quite different from the images of the other classes. Regarding the only one misclassified image, its visual appearance is close to the class Edamame as for the shape and color.

Table 6. Top 3 better and worst classification results on Food-101.
Fig. 5.
figure 5

Misclassified examples for the Food-101 classes that obtained the worst (steak) and best (edamame) classification results by F1 score (groundtruth label - bottom).

4 Conclusions

In this paper, we addressed the problem of food recognition and proposed a CNNs Fusion approach based on the concepts of decision templates and decision profiles and their similarity that improves the classification performance with respect to using CNN models separately. Evaluating different similarity measures, we show that the optimal one is based on the infinimum of the maximum between the complementary of the decision templates and the decision profile of the test images. On Food-11, our approach outperforms the baseline accuracy by more than 10% of accuracy. As for Food-101, we used two CNN architectures providing the best state of the art results where our CNNs Fusion strategy outperformed them again. As a future work, we plan to evaluate the performance of the CNN Fusion strategy as a function of the number of CNN models.