HyperAI

Firefly Chinese Llama2 Incremental Pre-training Dataset

Date

2 years ago

Size

9.02 GB

Publish URL

huggingface.co

The dataset is Firefly-LLaMA2-Chinese project The incremental pre-training data totals about 22GB of text, mainly including open source data sets such as CLUE, ThucNews, CNews, COIG, Wikipedia, and ancient poems, prose, classical Chinese, etc. collected by the research team. The data distribution is shown in the figure below.

firefly-pretrain-dataset.torrent
Seeding 2Downloading 0Completed 116Total Downloads 143
  • firefly-pretrain-dataset/
    • README.md
      1.04 KB
    • README.txt
      2.09 KB
      • data/
        • firefly-pretrain-dataset.zip
          9.02 GB