4 months ago

Abstract

This work proposes Recurrent Neural Network (RNN) models to predictstructured 'image situations' -- actions and noun entities fulfilling semanticroles related to the action. In contrast to prior work relying on ConditionalRandom Fields (CRFs), we use a specialized action prediction network followedby an RNN for noun prediction. Our system obtains state-of-the-art accuracy onthe challenging recent imSitu dataset, beating CRF-based models, including onestrained with additional data. Further, we show that specialized featureslearned from situation prediction can be transferred to the task of imagecaptioning to more accurately describe human-object interactions.

Source PDF