3 个月前

CamemBERT：一款美味的法语语言模型

Louis Martin Benjamin Muller Pedro Javier Ortiz Suárez Yoann Dupont Laurent Romary Éric Villemonte de la Clergerie Djamé Seddah Benoît Sagot

摘要

预训练语言模型如今已在自然语言处理领域广泛应用。尽管取得了显著成功，但大多数现有模型要么仅基于英语数据训练，要么基于多种语言数据的拼接进行训练。这使得这些模型在除英语以外的其他语言中的实际应用受到极大限制。本文探讨了为其他语言（以法语为例）训练单语Transformer架构语言模型的可行性，并在词性标注、依存句法分析、命名实体识别和自然语言推理四项下游任务上评估了所构建的语言模型。研究结果表明，使用网络爬取的数据相较于维基百科数据更具优势。更令人意外的是，仅使用较小规模的网络爬取数据集（4GB）即可取得与使用更大规模数据集（130GB以上）相当甚至更优的性能。我们提出的最优模型CamemBERT在上述四项任务中均达到或超越了当前最优水平，展现了强大的语言建模能力。

代码仓库

Karthik-Bhaskar/Context-Based-Question-Answering

GitHub 中提及

hbaflast/bert-sentiment-analysis-pytorch

pytorch

GitHub 中提及

anaishoareau/french_preprocessing

pytorch

GitHub 中提及

2024-MindSpore-1/Code2/tree/main/model-1/camembert

mindspore

huggingface/transformers

官方

pytorch

GitHub 中提及

pwc-1/Paper-8/tree/main/camembert

mindspore

bourrel/French-News-Clustering

GitHub 中提及

hbaflast/bert-sentiment-analysis-tensorflow

GitHub 中提及

基准测试

基准	方法	指标
dependency-parsing-on-french-gsd	CamemBERT	LAS: 92.47 UAS: 94.82
dependency-parsing-on-partut	CamemBERT	LAS: 92.9 UAS: 95.21
dependency-parsing-on-sequoia-treebank	CamemBERT	LAS: 94.39 UAS: 95.56
dependency-parsing-on-spoken-corpus	CamemBERT	LAS: 81.37 UAS: 86.05
named-entity-recognition-on-french-treebank	CamemBERT (subword masking)	F1: 87.93 Precision: 88.35 Recall: 87.46
natural-language-inference-on-xnli-french	CamemBERT (large)	Accuracy: 85.7
natural-language-inference-on-xnli-french	CamemBERT (base)	Accuracy: 81.2
part-of-speech-tagging-on-french-gsd	CamemBERT	UPOS: 98.19
part-of-speech-tagging-on-partut	CamemBERT	UPOS: 97.63
part-of-speech-tagging-on-sequoia-treebank	CamemBERT	UPOS: 99.21
part-of-speech-tagging-on-spoken-corpus	CamemBERT	UPOS: 96.68

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程

即用型 GPU

最优价格

立即开始

Hyper Newsletters

订阅我们的最新资讯

我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新

邮件发送服务由 MailChimp 提供

Command Palette

CamemBERT：一款美味的法语语言模型

Louis Martin Benjamin Muller Pedro Javier Ortiz Suárez Yoann Dupont Laurent Romary Éric Villemonte de la Clergerie Djamé Seddah Benoît Sagot

摘要

代码仓库

基准测试

用 AI 构建 AI

Hyper Newsletters