Command Palette
Search for a command to run...
Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
Lee Sanghyeok ; Choi Joonmyung ; Kim Hyunwoo J.

Abstract
Vision Transformer (ViT) has emerged as a prominent backbone for computervision. For more efficient ViTs, recent works lessen the quadratic cost of theself-attention layer by pruning or fusing the redundant tokens. However, theseworks faced the speed-accuracy trade-off caused by the loss of information.Here, we argue that token fusion needs to consider diverse relations betweentokens to minimize information loss. In this paper, we propose a Multi-criteriaToken Fusion (MCTF), that gradually fuses the tokens based on multi-criteria(e.g., similarity, informativeness, and size of fused tokens). Further, weutilize the one-step-ahead attention, which is the improved approach to capturethe informativeness of the tokens. By training the model equipped with MCTFusing a token reduction consistency, we achieve the best speed-accuracytrade-off in the image classification (ImageNet1K). Experimental results provethat MCTF consistently surpasses the previous reduction methods with andwithout training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs byabout 44% while improving the performance (+0.5%, and +0.3%) over the basemodel, respectively. We also demonstrate the applicability of MCTF in variousVision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedupwithout performance degradation. Code is available athttps://github.com/mlvlab/MCTF.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| efficient-vits-on-imagenet-1k-with-deit-s | MCTF ($r=18$) | GFLOPs: 2.4 Top 1 Accuracy: 79.9 |
| efficient-vits-on-imagenet-1k-with-deit-s | MCTF ($r=20$) | GFLOPs: 2.2 Top 1 Accuracy: 79.5 |
| efficient-vits-on-imagenet-1k-with-deit-s | MCTF ($r=16$) | GFLOPs: 2.6 Top 1 Accuracy: 80.1 |
| efficient-vits-on-imagenet-1k-with-deit-t | MCTF ($r=20$) | GFLOPs: 0.6 Top 1 Accuracy: 71.4 |
| efficient-vits-on-imagenet-1k-with-deit-t | MCTF ($r=8$) | GFLOPs: 1.0 Top 1 Accuracy: 72.9 |
| efficient-vits-on-imagenet-1k-with-deit-t | MCTF ($r=16$) | GFLOPs: 0.7 Top 1 Accuracy: 72.7 |
| efficient-vits-on-imagenet-1k-with-lv-vit-s | MCTF ($r=16$) | GFLOPs: 3.6 Top 1 Accuracy: 82.3 |
| efficient-vits-on-imagenet-1k-with-lv-vit-s | MCTF ($r=8$) | GFLOPs: 4.9 Top 1 Accuracy: 83.5 |
| efficient-vits-on-imagenet-1k-with-lv-vit-s | MCTF ($r=12$) | GFLOPs: 4.2 Top 1 Accuracy: 83.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.