HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition

Roy Debaditya ; Verma Dhruv ; Fernando Basura

ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in
  Situation Recognition

Abstract

Situation Recognition is the task of generating a structured summary of whatis happening in an image using an activity verb and the semantic roles playedby actors and objects. In this task, the same activity verb can describe adiverse set of situations as well as the same actor or object category can playa diverse set of semantic roles depending on the situation depicted in theimage. Hence a situation recognition model needs to understand the context ofthe image and the visual-linguistic meaning of semantic roles. Therefore, weleverage the CLIP foundational model that has learned the context of images vialanguage descriptions. We show that deeper-and-wider multi-layer perceptron(MLP) blocks obtain noteworthy results for the situation recognition task byusing CLIP image and text embedding features and it even outperforms thestate-of-the-art CoFormer, a Transformer-based model, thanks to the externalimplicit visual-linguistic knowledge encapsulated by CLIP and the expressivepower of modern MLP block designs. Motivated by this, we design across-attention-based Transformer using CLIP visual tokens that model therelation between textual roles and visual entities. Our cross-attention-basedTransformer known as ClipSitu XTF outperforms existing state-of-the-art by alarge margin of 14.1\% on semantic role labelling (value) for top-1 accuracyusing imSitu dataset. {Similarly, our ClipSitu XTF obtains state-of-the-artsituation localization performance.} We will make the code publicly available.

Code Repositories

LUNAProject22/CLIPSitu
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
grounded-situation-recognition-on-swigClipSitu
Top-1 Verb: 58.19
Top-1 Verb u0026 Grounded-Value: 40.01
Top-1 Verb u0026 Value: 47.23
Top-5 Verbs: 85.69
Top-5 Verbs u0026 Grounded-Value: 49.78
Top-5 Verbs u0026 Value: 68.42
situation-recognition-on-imsituClipSitu
Top-1 Verb: 47.23
Top-1 Verb u0026 Value: 29.73
Top-5 Verbs: 85.69
Top-5 Verbs u0026 Value: 68.42

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp