HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai; Zhilin Yang; Yiming Yang; Jaime Carbonell; Quoc V. Le; Ruslan Salakhutdinov

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

Code Repositories

sooftware/attentions
pytorch
Mentioned in GitHub
SambhawDrag/XLNet.jl
pytorch
Mentioned in GitHub
TimDettmers/transformer-xl
pytorch
Mentioned in GitHub
mustafaaljadery/gemma-2b-10m
pytorch
Mentioned in GitHub
wxt1997/Transformer-Transducer
pytorch
Mentioned in GitHub
okkteam/Transformer-Transducer
pytorch
Mentioned in GitHub
cmunnis/BERT_vs_Transformer-XL
pytorch
Mentioned in GitHub
Jmkernes/PAR-Transformer-XL
tf
Mentioned in GitHub
kimiyoung/transformer-xl
Official
pytorch
Mentioned in GitHub
aiha-lab/Attention-Head-Pruning
pytorch
Mentioned in GitHub
zhdbwe/Paper-DailyReading
tf
Mentioned in GitHub
sh951011/Attention-Implementation
pytorch
Mentioned in GitHub
listenviolet/XLNet
pytorch
Mentioned in GitHub
google-research/meliad
jax
Mentioned in GitHub
huggingface/transformers
pytorch
Mentioned in GitHub
sooftware/conformer
pytorch
Mentioned in GitHub
inzva/fake-academic-paper-generation
pytorch
Mentioned in GitHub
samwisegamjeee/pytorch-transformers
pytorch
Mentioned in GitHub
cedrickchee/pytorch-pretrained-BERT
pytorch
Mentioned in GitHub
sooftware/nlp-attentions
pytorch
Mentioned in GitHub
park-cheol/ASR-Conformer
pytorch
Mentioned in GitHub
sooftware/Attention-Implementation
pytorch
Mentioned in GitHub
huggingface/xlnet
tf
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
language-modelling-on-enwiki8Transformer-XL (12 layers)
Bit per Character (BPC): 1.06
Number of params: 41M
language-modelling-on-enwiki8Transformer-XL (24 layers)
Bit per Character (BPC): 0.99
Number of params: 277M
language-modelling-on-enwiki8Transformer-XL (18 layers)
Bit per Character (BPC): 1.03
Number of params: 88M
language-modelling-on-hutter-prize18-layer Transformer-XL
Bit per Character (BPC): 1.03
Number of params: 88M
language-modelling-on-hutter-prize12-layer Transformer-XL
Bit per Character (BPC): 1.06
Number of params: 41M
language-modelling-on-hutter-prize24-layer Transformer-XL
Bit per Character (BPC): 0.99
Number of params: 277M
language-modelling-on-one-billion-wordTransformer-XL Large
Number of params: 0.8B
PPL: 21.8
language-modelling-on-one-billion-wordTransformer-XL Base
Number of params: 0.46B
PPL: 23.5
language-modelling-on-penn-treebank-wordTransformer-XL
Params: 24M
Test perplexity: 54.55
Validation perplexity: 56.72
language-modelling-on-text8Transformer-XL - 24 layers
Bit per Character (BPC): 1.08
Number of params: 277M
language-modelling-on-wikitext-103Transformer-XL Large
Number of params: 257M
Test perplexity: 18.3
Validation perplexity: 18.2
language-modelling-on-wikitext-103Transformer-XL Standard
Number of params: 151M
Test perplexity: 24.0
Validation perplexity: 23.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp