HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts

Sajad Sotudeh; Hanieh Deilamsalehy; Franck Dernoncourt; Nazli Goharian

TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts

Abstract

Recent models in developing summarization systems consist of millions of parameters and the model performance is highly dependent on the abundance of training data. While most existing summarization corpora contain data in the order of thousands to one million, generation of large-scale summarization datasets in order of couple of millions is yet to be explored. Practically, more data is better at generalizing the training patterns to unseen data. In this paper, we introduce TLDR9+ -- a large-scale summarization dataset -- containing over 9 million training instances extracted from Reddit discussion forum (https://github.com/sajastu/reddit_collector). This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. We go one step further and with the help of human annotations, we distill a more fine-grained dataset by sampling High-Quality instances from TLDR9+ and call it TLDRHQ dataset. We further pinpoint different state-of-the-art summarization models on our proposed datasets.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
extreme-summarization-on-tldr9ORACLE-EXT
RG-1(%): 30.26
RG-2(%): 9.74
RG-L(%): 20.60
extreme-summarization-on-tldr9BART
RG-1(%): 23.59
RG-2(%): 9.69
RG-L(%): 18.62
extreme-summarization-on-tldr9BERTSUMEXT
RG-1(%): 20.94
RG-2(%): 4.98
RG-L(%): 14.48
extreme-summarization-on-tldr9BERTSUMABS
RG-1(%): 23.05
RG-2(%): 9.48
RG-L(%): 18.07

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp