HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Xiang Dai; Sarvnaz Karimi; Ben Hachey; Cecile Paris

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Abstract

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

Benchmarks

BenchmarkMethodologyMetrics
clinical-concept-extraction-on-2010-i2b2vaClinicalBERT
Exact Span F1: 87.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media | Papers | HyperAI