3 months ago

Sequence Generation with Mixed Representations

{Lijun Wu Shufang Xie Yingce Xia Fan Yang Tao Qin Jianhuang Lai Tie-Yan Liu}

Abstract

Tokenization is the first step of many natural language processing (NLP) tasks and plays an important role for neural NLP models. Tokenizaton method such as byte-pair encoding (BPE), which can greatly reduce the large vocabulary and deal with out-of-vocabulary words, has shown to be effective and is widely adopted for sequence generation tasks. While various tokenization methods exist, there is no common acknowledgement which is the best. In this work, we propose to leverage the mixed representations from different tokenization methods for sequence generation tasks, in order to boost the model performance with unique characteristics and advantages of individual tokenization methods. Specifically, we introduce a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods. Our approach achieves significant improvements on neural machine translation (NMT) tasks with six language pairs (e.g., English↔German, English↔Romanian), as well as an abstractive summarization task.

Benchmarks

Benchmark	Methodology	Metrics
machine-translation-on-iwslt2014-english	MixedRepresentations	BLEU score: 29.93
machine-translation-on-iwslt2014-german	MixedRepresentations	BLEU score: 36.41

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning