HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multi-components System for Automatic Arabic Diacritization

{Shengwu Xiong Hamza Abbad}

Multi-components System for Automatic Arabic Diacritization

Abstract

In this paper, we propose an approach to tackle the problem of the automatic restoration of Arabic diacritics that includes three components stacked in a pipeline: a deep learning model which is a multi-layer recurrent neural network with LSTM and Dense layers, a character-level rule-based corrector which applies deterministic operations to prevent some errors, and a word-level statistical corrector which uses the context and the distance information to fix some diacritization issues. This approach is novel in a way that combines methods of different types and adds edit distance based corrections.We used a large public dataset containing raw diacritized Arabic text (Tashkeela) for training and testing our system after cleaning and normalizing it. On a newly-released benchmark test set, our system outperformed all the tested systems by achieving DER of 3.39% and WER of 9.94% when taking all Arabic letters into account, DER of 2.61% and WER of 5.83% when ignoring the diacritization of the last letter of every word.

Benchmarks

BenchmarkMethodologyMetrics
arabic-text-diacritization-on-tashkeela-1MC
Diacritic Error Rate: 0.0339
Word Error Rate (WER): 0.0994

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multi-components System for Automatic Arabic Diacritization | Papers | HyperAI