Command Palette
Search for a command to run...
Swetha Mandava Szymon Migacz Alex Fit Florea

Abstract
Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| language-modelling-on-enwiki8-1 | PAR Transformer 24B | Bit per Character (BPC): 1.11 |
| language-modelling-on-text8 | PAR Transformer 24B | Bit per Character (BPC): 1.18 |
| language-modelling-on-wikitext-103 | PAR Transformer Base | Test perplexity: 22.7 |
| language-modelling-on-wikitext-103 | PAR Transformer Large | Test perplexity: 18.4 |
| sentiment-analysis-on-sst-2-binary | PAR BERT Base | Accuracy: 91.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.