Command Palette
Search for a command to run...
Wei Meng ; Chen Long ; Ji Wei ; Yue Xiaoyu ; Chua Tat-Seng

Abstract
Grounded Situation Recognition (GSR), i.e., recognizing the salient activity(or verb) category in an image (e.g., buying) and detecting all correspondingsemantic roles (e.g., agent and goods), is an essential step towards"human-like" event understanding. Since each verb is associated with a specificset of semantic roles, all existing GSR methods resort to a two-stageframework: predicting the verb in the first stage and detecting the semanticroles in the second stage. However, there are obvious drawbacks in both stages:1) The widely-used cross-entropy (XE) loss for object recognition isinsufficient in verb classification due to the large intra-class variation andhigh inter-class similarity among daily activities. 2) All semantic roles aredetected in an autoregressive manner, which fails to model the complex semanticrelations between different roles. To this end, we propose a novel SituFormerfor GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and aTransformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: acoarse-grained model trained with XE loss first proposes a set of verbcandidates, and then a fine-grained model trained with triplet loss re-ranksthese candidates with enhanced verb features (not only separable but alsodiscriminative). TNM is a transformer-based semantic role detection model,which detects all roles parallelly. Owing to the global relation modelingability and flexibility of the transformer decoder, TNM can fully explore thestatistical dependency of the roles. Extensive validations on the challengingSWiG benchmark show that SituFormer achieves a new state-of-the-art performancewith significant gains under various metrics. Code is available athttps://github.com/kellyiss/SituFormer.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| grounded-situation-recognition-on-swig | SituFormer | Top-1 Verb: 44.2 Top-1 Verb u0026 Grounded-Value: 29.22 Top-1 Verb u0026 Value: 35.24 Top-5 Verbs: 71.21 Top-5 Verbs u0026 Grounded-Value: 46 Top-5 Verbs u0026 Value: 55.75 |
| situation-recognition-on-imsitu | SituFormer | Top-1 Verb: 44.2 Top-1 Verb u0026 Value: 35.24 Top-5 Verbs: 71.21 Top-5 Verbs u0026 Value: 55.75 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.