Firefly Chinese Llama2 Incremental Pre-training Dataset
Date
2 years ago
Size
9.02 GB
Publish URL
Tags
Categories
The dataset is Firefly-LLaMA2-Chinese project The incremental pre-training data totals about 22GB of text, mainly including open source data sets such as CLUE, ThucNews, CNews, COIG, Wikipedia, and ancient poems, prose, classical Chinese, etc. collected by the research team. The data distribution is shown in the figure below.

firefly-pretrain-dataset.torrent
Seeding 2Downloading 0Completed 116Total Downloads 143