Command Palette
Search for a command to run...
ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition
Roy Debaditya ; Verma Dhruv ; Fernando Basura

Abstract
Situation Recognition is the task of generating a structured summary of whatis happening in an image using an activity verb and the semantic roles playedby actors and objects. In this task, the same activity verb can describe adiverse set of situations as well as the same actor or object category can playa diverse set of semantic roles depending on the situation depicted in theimage. Hence a situation recognition model needs to understand the context ofthe image and the visual-linguistic meaning of semantic roles. Therefore, weleverage the CLIP foundational model that has learned the context of images vialanguage descriptions. We show that deeper-and-wider multi-layer perceptron(MLP) blocks obtain noteworthy results for the situation recognition task byusing CLIP image and text embedding features and it even outperforms thestate-of-the-art CoFormer, a Transformer-based model, thanks to the externalimplicit visual-linguistic knowledge encapsulated by CLIP and the expressivepower of modern MLP block designs. Motivated by this, we design across-attention-based Transformer using CLIP visual tokens that model therelation between textual roles and visual entities. Our cross-attention-basedTransformer known as ClipSitu XTF outperforms existing state-of-the-art by alarge margin of 14.1\% on semantic role labelling (value) for top-1 accuracyusing imSitu dataset. {Similarly, our ClipSitu XTF obtains state-of-the-artsituation localization performance.} We will make the code publicly available.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| grounded-situation-recognition-on-swig | ClipSitu | Top-1 Verb: 58.19 Top-1 Verb u0026 Grounded-Value: 40.01 Top-1 Verb u0026 Value: 47.23 Top-5 Verbs: 85.69 Top-5 Verbs u0026 Grounded-Value: 49.78 Top-5 Verbs u0026 Value: 68.42 |
| situation-recognition-on-imsitu | ClipSitu | Top-1 Verb: 47.23 Top-1 Verb u0026 Value: 29.73 Top-5 Verbs: 85.69 Top-5 Verbs u0026 Value: 68.42 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.