HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Shukor Mustafa ; Thome Nicolas ; Cord Matthieu

Vision and Structured-Language Pretraining for Cross-Modal Food
  Retrieval

Abstract

Vision-Language Pretraining (VLP) and Foundation models have been the go-torecipe for achieving SoTA performance on general benchmarks. However,leveraging these powerful techniques for more complex vision-language tasks,such as cooking applications, with more structured input data, is still littleinvestigated. In this work, we propose to leverage these techniques forstructured-text based computational cuisine tasks. Our strategy, dubbedVLPCook, first transforms existing image-text pairs to image andstructured-text pairs. This allows to pretrain our VLPCook model using VLPobjectives adapted to the strutured data of the resulting datasets, thenfinetuning it on downstream computational cooking tasks. During finetuning, wealso enrich the visual encoder, leveraging pretrained foundation models (e.g.CLIP) to provide local and global textual context. VLPCook outperforms currentSoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the taskof Cross-Modal Food Retrieval on the large Recipe1M dataset. We conduct furtherexperiments on VLP to validate their importance, especially on the Recipe1M+dataset. Finally, we validate the generalization of the approach to other tasks(i.e, Food Recognition) and domains with structured text such as the Medicaldomain on the ROCO dataset. The code is available here:https://github.com/mshukor/VLPCook

Code Repositories

mshukor/vlpcook
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
cross-modal-retrieval-on-recipe1mVLPCook
Image-to-text R@1: 73.6
Text-to-image R@1: 74.7
cross-modal-retrieval-on-recipe1mVLPCook (R1M+)
Image-to-text R@1: 74.9
Text-to-image R@1: 75.6
cross-modal-retrieval-on-recipe1m-1VLPCook
Image-to-text R@1: 45.2
Text-to-image R@1: 47.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp