Command Palette
Search for a command to run...
Zixuan Su Hao Zhang Jingjing Chen Lei Pang Chong-Wah Ngo Yu-Gang Jiang

Abstract
Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models to best utilize each technique. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention, without concerning the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of dual-path are fused with weighting scalars calculated from visual cues. We also design the convolutional path compactly for efficiency concerns. Extensive experiments on standard benchmarks, such as ImageNet-1K, CIFAR-10, and CIFAR-100, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs/56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-classification-on-cifar-10 | ASF-former-S | Percentage correct: 98.7 |
| image-classification-on-cifar-10 | ASF-former-B | Percentage correct: 98.8% |
| image-classification-on-cifar-10-image | ASF-former-B | Params: 56.7M |
| image-classification-on-cifar-10-image | ASF-former-S | Params: 19.3M |
| image-classification-on-imagenet | ASF-former-B | Number of params: 56.7M Top 1 Accuracy: 83.9% |
| image-classification-on-imagenet | ASF-former-S | Number of params: 19.3M Top 1 Accuracy: 82.7% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.