Command Palette
Search for a command to run...
Ben Krause; Emmanuel Kahembwe; Iain Murray; Steve Renals

Abstract
This research note combines two methods that have recently improved the state of the art in language modeling: Transformers and dynamic evaluation. Transformers use stacked layers of self-attention that allow them to capture long range dependencies in sequential data. Dynamic evaluation fits models to the recent sequence history, allowing them to assign higher probabilities to re-occurring sequential patterns. By applying dynamic evaluation to Transformer-XL models, we improve the state of the art on enwik8 from 0.99 to 0.94 bits/char, text8 from 1.08 to 1.04 bits/char, and WikiText-103 from 18.3 to 16.4 perplexity points.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| language-modelling-on-enwiki8 | Transformer-XL (24 layers, RMS dynamic eval, decay) | Bit per Character (BPC): 0.940 Number of params: 277M |
| language-modelling-on-hutter-prize | Transformer-XL + RMS dynamic eval | Bit per Character (BPC): 0.94 Number of params: 277M |
| language-modelling-on-text8 | Transformer-XL + RMS dynamic eval + decay | Bit per Character (BPC): 1.038 Number of params: 277M |
| language-modelling-on-wikitext-103 | Transformer-XL (RMS dynamic eval) | Number of params: 257M Test perplexity: 16.4 Validation perplexity: 15.8 |
| language-modelling-on-wikitext-103 | Transformer-XL (SGD dynamic eval) | Number of params: 257M Test perplexity: 17.0 Validation perplexity: 16.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.