Command Palette
Search for a command to run...
Mallya Arun Lazebnik Svetlana

Abstract
This work proposes Recurrent Neural Network (RNN) models to predictstructured 'image situations' -- actions and noun entities fulfilling semanticroles related to the action. In contrast to prior work relying on ConditionalRandom Fields (CRFs), we use a specialized action prediction network followedby an RNN for noun prediction. Our system obtains state-of-the-art accuracy onthe challenging recent imSitu dataset, beating CRF-based models, including onestrained with additional data. Further, we show that specialized featureslearned from situation prediction can be transferred to the task of imagecaptioning to more accurately describe human-object interactions.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| grounded-situation-recognition-on-swig | RNN + Fusion | Top-1 Verb: 35.9 Top-1 Verb u0026 Value: 27.45 Top-5 Verbs: 63.08 Top-5 Verbs u0026 Value: 46.88 |
| situation-recognition-on-imsitu | RNN + Fusion | Top-1 Verb: 35.9 Top-1 Verb u0026 Value: 27.45 Top-5 Verbs: 63.08 Top-5 Verbs u0026 Value: 46.88 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.