Command Palette
Search for a command to run...

Abstract
We address the problem of recognizing situations in images. Given an image,the task is to predict the most salient verb (action), and fill its semanticroles such as who is performing the action, what is the source and target ofthe action, etc. Different verbs have different roles (e.g. attacking hasweapon), and each role can take on many possible values (nouns). We propose amodel based on Graph Neural Networks that allows us to efficiently capturejoint dependencies between roles using neural networks defined on a graph.Experiments with different graph connectivities show that our approach thatpropagates information between roles significantly outperforms existing work,as well as multiple baselines. We obtain roughly 3-5% improvement over previouswork in predicting the full situation. We also provide a thorough qualitativeanalysis of our model and influence of different roles in the verbs.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| grounded-situation-recognition-on-swig | GraphNet | Top-1 Verb: 36.72 Top-1 Verb u0026 Value: 27.52 Top-5 Verbs: 61.90 Top-5 Verbs u0026 Value: 45.39 |
| situation-recognition-on-imsitu | GraphNet | Top-1 Verb: 36.72 Top-1 Verb u0026 Value: 27.52 Top-5 Verbs: 61.90 Top-5 Verbs u0026 Value: 45.39 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.