Log in

Recognize after early fusion: the Chinese food recognition based on the alignment of image and ingredients

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

As concerns about health continue to grow, more and more works are being done in the field of food computing. One of the basic topics in food computing is how to extract important information from food and analysis it from a picture. However, food recognition poses some challenges. One challenge is that the type of food is closely related to its ingredients. Another challenge is that in Chinese dietary habits, a single meal typically includes multiple dishes. But existing food image datasets only contain single-food pictures. To address these challenges, we propose our model, Recognize After Early Fusion (RAEF): the Chinese food recognition based on the alignment of image and ingredients. We use a Vision Transformer as the backbone of our model and use an early fusion model to combine visual and ingredient features. Because there are no suitable datasets for multi-label food recognition models, we propose a new Chinese food dataset named Chinsefood-130. The dataset is in https://pan.baidu.com/s/1gpjAY3JBX_wGNuCLhxLmQQ password: mr2b. After conducting experiments, we found that RAEF has great performance in both food and ingredient recognition. Compared to the performance of ViT, RAEF shows an F1 score improvement of 10% on food recognition and 12% on ingredient recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availiabilty

The dataset we propose is public. The url is in the acticle.

References

  1. Min, W., Jiang, S., Liu, L., Rui, Y., Jain, R.: A survey on food computing. ACM Comput. Surv. (CSUR) 52(5), 1–36 (2019)

    Article  Google Scholar 

  2. Gao, X., Feng, F., Huang, H., Mao, X.-L., Lan, T., Chi, Z.: Food recommendation with graph convolutional network. Inf. Sci. 584, 170–183 (2022)

    Article  Google Scholar 

  3. Song, Y., Yang, X., Xu, C.: Self-supervised calorie-aware heterogeneous graph networks for food recommendation. ACM Trans. Multimedia Comput. Commun. Appl. 19(1s), 1–23 (2023)

    Article  Google Scholar 

  4. Rostami, M., Oussalah, M., Farrahi, V.: A novel time-aware food recommender-system based on deep learning and graph clustering. IEEE Access 10, 52508–52524 (2022)

    Article  Google Scholar 

  5. Toledo, R.Y., Alzahrani, A.A., Martinez, L.: A food recommender system considering nutritional information and user preferences. IEEE Access 7, 96695–96711 (2019)

    Article  Google Scholar 

  6. Herzig, D., Nakas, C.T., Stalder, J., Kosinski, C., Laesser, C., Dehais, J., Jaeggi, R., Leichtle, A.B., Dahlweid, F.-M., Stettler, C., et al.: Volumetric food quantification using computer vision on a depth-sensing smartphone: Preclinical study. JMIR mHealth and uHealth 8(3), 15294 (2020)

    Article  Google Scholar 

  7. Thames, Q., Karpur, A., Norris, W., **a, F., Panait, L., Weyand, T., Sim, J.: Nutrition5k: towards automatic nutritional understanding of generic food. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8903–8911 (2021)

  8. Abdallah, S., Elmessery, W.M., Shams, M., Al-Sattary, N., Abohany, A., Thabet, M.: Deep learning model based on resnet-50 for beef quality classification. Inf. Sci. Lett. 12(1), 289–297 (2023)

    Article  Google Scholar 

  9. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part VI 13, pp. 446–461. Springer, Berlin (2014)

  10. Min, W., Liu, L., Wang, Z., Luo, Z., Wei, X., Wei, X., Jiang, S.: Isia food-500: a dataset for large-scale food recognition via stacked global-local attention network. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 393–401 (2020)

  11. Chen, J., Ngo, C.-W.: Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 32–41 (2016)

  12. Chen, X., Zhu, Y., Zhou, H., Diao, L., Wang, D.: Chinesefoodnet: a large-scale image dataset for Chinese food recognition. ar**v:1705.02743 (2017)

  13. Min, W., Liu, L., Luo, Z., Jiang, S.: Ingredient-guided cascaded multi-attention network for food recognition. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1331–1339 (2019)

  14. Dewantara, B.S.B., Devy, A.Z., Bachtiar, M.M., et al.: Recognition of food material and measurement of quality using yolo and wld-svm. In: 2021 International Electronics Symposium (IES), pp. 545–551, IEEE (2021)

  15. Luo, L.: Research on food image recognition of deep learning algorithms. In: 2023 International Conference on Computers, Information Processing and Advanced Education (CIPAE), pp. 733–737, IEEE (2023)

  16. Poply, P.: An instance segmentation approach to food calorie estimation using mask r-cnn. In: Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning, pp. 73–78 (2020)

  17. Agarwal, R., Choudhury, T., Ahuja, N.J., Sarkar, T., et al.: Hybrid deep learning algorithm-based food recognition and calorie estimation. J. Food Process. Preserv. 2023 (2023)

  18. Qiu, J., Lo, F.P.-W., Sun, Y., Wang, S., Lo, B.: Mining discriminative food regions for accurate food recognition. ar**v:2207.03692 (2022)

  19. Min, W., Wang, Z., Liu, Y., Luo, M., Kang, L., Wei, X., Wei, X., Jiang, S.: Large scale visual food recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)

  20. Zhang, M., Tian, G., Zhang, Y., Liu, H.: Sequential learning for ingredient recognition from images. IEEE Trans. Circuits Syst. Video Technol. 33(5), 2162–2175 (2023). https://doi.org/10.1109/TCSVT.2022.3218790

    Article  Google Scholar 

  21. Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging. ar**v:1508.01991 (2015)

  22. Liu, C., Liang, Y., Xue, Y., Qian, X., Fu, J.: Food and ingredient joint learning for fine-grained recognition. IEEE Trans. Circuits Syst. Video Technol. 31(6), 2480–2493 (2021). https://doi.org/10.1109/TCSVT.2020.3020079

    Article  Google Scholar 

  23. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. ar**v:2010.11929 (2020)

  24. Liu, C., Cao, Y., Luo, Y., Chen, G., Vokkarane, V., Ma, Y.: Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. In: Inclusive Smart Cities and Digital Health: 14th International Conference on Smart Homes and Health Telematics, ICOST 2016, Wuhan, China, May 25–27, 2016. Proceedings 14, pp. 37–48. Springer (2016)

  25. Zahisham, Z., Lee, C.P., Lim, K.M.: Food recognition with resnet-50. In: 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), pp. 1–5. IEEE (2020)

  26. Kaur, R., Kumar, R., Gupta, M.: Deep neural network for food image classification and nutrient identification: a systematic review. Rev. Endocrine Metab. Disord. 1–21 (2023)

  27. Shah, B., Bhavsar, H.: Depth-restricted convolutional neural network-a model for gujarati food image classification. Vis. Comp. 1–16 (2023)

  28. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  29. Martinel, N., Foresti, G.L., Micheloni, C.: Wide-slice residual networks for food recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 567–576. IEEE (2018)

  30. Jiang, S., Min, W., Liu, L., Luo, Z.: Multi-scale multi-view deep feature aggregation for food recognition. IEEE Trans. Image Process. 29, 265–276 (2019)

    Article  MathSciNet  Google Scholar 

  31. Salvador, A., Drozdzal, M., Giró-i-Nieto, X., Romero, A.: Inverse cooking: Recipe generation from food images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10453–10462 (2019)

  32. Han, F., Guerrero, R., Pavlovic, V.: Cookgan: Meal image synthesis from ingredients. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1450–1458 (2020)

  33. Papadopoulos, D.P., Tamaazousti, Y., Ofli, F., Weber, I., Torralba, A.: How to make a pizza: Learning a compositional layer-based gan model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8002–8011 (2019)

  34. Fu, H., Wu, R., Liu, C., Sun, J.: Mcen: Bridging cross-modal gap between cooking recipes and dish images with latent variable model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14570–14580 (2020)

  35. Papadopoulos, D.P., Mora, E., Chepurko, N., Huang, K.W., Ofli, F., Torralba, A.: Learning program representations for food images and cooking recipes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16559–16569 (2022)

  36. Salvador, A., Gundogdu, E., Bazzani, L., Donoser, M.: Revam** cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15475–15484 (2021)

  37. Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: Cnn-rnn: A unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)

  38. Chen, Z.-M., Wei, X.-S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2019)

  39. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. ar**v:1609.02907 (2016)

  40. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)

  41. Zhao, J., Zhao, Y., Li, J.: M3tr: Multi-modal multi-label recognition with transformer. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 469–477 (2021)

  42. Jiang, X., Zhu, Y., Cai, G., Zheng, B., Yang, D.: Mxt: a new variant of pyramid vision transformer for multi-label chest x-ray image classification. Cogn. Comput. 14(4), 1362–1377 (2022)

    Article  Google Scholar 

  43. He, L., Cai, Z., Ouyang, D., Bai, H.: Food recognition model based on deep learning and attention mechanism. In: 2022 8th International Conference on Big Data Computing and Communications (BigCom), pp. 206–216 (2022). https://doi.org/10.1109/BigCom57025.2022.00034

  44. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ar**v:1409.1556 (2014)

  45. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  46. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  47. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  48. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  49. Liu, C., Cao, Y., Luo, Y., Chen, G., Vokkarane, V., Ma, Y.: Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. In: Inclusive Smart Cities and Digital Health: 14th International Conference on Smart Homes and Health Telematics, ICOST 2016, Wuhan, China, May 25–27, 2016. Proceedings 14, pp. 37–48. Springer (2016)

  50. Bolanos, M., Radeva, P.: Simultaneous food localization and recognition. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3140–3145. IEEE (2016)

  51. Yanai, K., Kawano, Y.: Food image recognition using deep convolutional network with pre-training and fine-tuning. In: 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6. IEEE (2015)

  52. Aguilar, E., Bolaños, M., Radeva, P.: Food recognition using fusion of classifiers based on cnns. In: Image Analysis and Processing-ICIAP 2017: 19th International Conference, Catania, Italy, September 11–15, 2017, Proceedings, Part II 19, pp. 213–224. Springer (2017)

  53. Hassannejad, H., Matrella, G., Ciampolini, P., De Munari, I., Mordonini, M., Cagnoni, S.: Food image recognition using very deep convolutional networks. In: Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, pp. 41–49 (2016)

  54. Kornblith, S., Shlens, J., Le, Q.V.: Do better imagenet models transfer better? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2661–2671 (2019)

  55. McAllister, P., Zheng, H., Bond, R., Moorhead, A.: Combining deep residual neural network features with supervised machine learning algorithms to classify diverse food image datasets. Comput. Biol. Med. 95, 217–233 (2018)

    Article  Google Scholar 

Download references

Funding

This work was supported by National Key Research and Development Program of China (2022YFF06069003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongtao Bai.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, R., Ouyang, D., He, L. et al. Recognize after early fusion: the Chinese food recognition based on the alignment of image and ingredients. Multimedia Systems 30, 93 (2024). https://doi.org/10.1007/s00530-024-01297-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01297-w

Keywords

Navigation