8 months ago

Abstract

Vision Transformer (ViT) has emerged as a prominent backbone for computervision. For more efficient ViTs, recent works lessen the quadratic cost of theself-attention layer by pruning or fusing the redundant tokens. However, theseworks faced the speed-accuracy trade-off caused by the loss of information.Here, we argue that token fusion needs to consider diverse relations betweentokens to minimize information loss. In this paper, we propose a Multi-criteriaToken Fusion (MCTF), that gradually fuses the tokens based on multi-criteria(e.g., similarity, informativeness, and size of fused tokens). Further, weutilize the one-step-ahead attention, which is the improved approach to capturethe informativeness of the tokens. By training the model equipped with MCTFusing a token reduction consistency, we achieve the best speed-accuracytrade-off in the image classification (ImageNet1K). Experimental results provethat MCTF consistently surpasses the previous reduction methods with andwithout training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs byabout 44% while improving the performance (+0.5%, and +0.3%) over the basemodel, respectively. We also demonstrate the applicability of MCTF in variousVision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedupwithout performance degradation. Code is available athttps://github.com/mlvlab/MCTF.

Source PDF View Code