3 months ago

Beyond Characters: Subword-level Morpheme Segmentation

{Andre F. T. Martins Ben Peters}

Abstract

This paper presents DeepSPIN’s submissions to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation. We make three submissions, all to the word-level subtask. First, we show that entmax-based sparse sequence-tosequence models deliver large improvements over conventional softmax-based models, echoing results from other tasks. Then, we challenge the assumption that models for morphological tasks should be trained at the character level by building a transformer that generates morphemes as sequences of unigram language model-induced subwords. This subword transformer outperforms all of our character-level models and wins the word-level subtask. Although we do not submit an official submission to the sentence-level subtask, we show that this subword-based approach is highly effective there as well.

Benchmarks

Benchmark	Methodology	Metrics
morpheme-segmentaiton-on-unimorph-4-0	Char LSTM (DeepSPIN-2; soft-attention, 1-5 entmax)	macro avg (subtask 1): 97.15
morpheme-segmentaiton-on-unimorph-4-0	Subword-ULM transformer (DeepSPIN-3; soft-attention, 1-5 entmax)	macro avg (subtask 1): 97.29
morpheme-segmentaiton-on-unimorph-4-0	Char LSTM (DeepSPIN-1; soft-attention)	macro avg (subtask 1): 96.32

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning