Abstract

Most object recognition approaches predominantly focus on learningdiscriminative visual patterns while overlooking the holistic object structure.Though important, structure modeling usually requires significant manualannotations and therefore is labor-intensive. In this paper, we propose to"look into object" (explicitly yet intrinsically model the object structure)through incorporating self-supervisions into the traditional framework. We showthe recognition backbone can be substantially enhanced for more robustrepresentation learning, without any cost of extra annotation and inferencespeed. Specifically, we first propose an object-extent learning module forlocalizing the object according to the visual patterns shared among theinstances in the same category. We then design a spatial context learningmodule for modeling the internal structures of the object, through predictingthe relative positions within the extent. These two modules can be easilyplugged into any backbone networks during training and detached at inferencetime. Extensive experiments show that our look-into-object approach (LIO)achieves large performance gain on a number of benchmarks, including genericobject recognition (ImageNet) and fine-grained object recognition tasks (CUB,Cars, Aircraft). We also show that this learning paradigm is highlygeneralizable to other tasks such as object detection and segmentation (MSCOCO). Project page: https://github.com/JDAI-CV/LIO.

Source PDF View Code