HyperAIHyperAI

Nemotron-Pretraining-Dataset-sample Sampling Dataset

Date

a month ago

Size

79.87 MB

Organization

NVIDIA

Publish URL

huggingface.co

Paper URL

2508.14444

License

其他

* This dataset supports online use.Click here to jump.

Nemotron-Pretraining-Dataset-sample is a streamlined sampling version of the Nemotron pretraining dataset released by NVIDIA in 2025. The related paper results are "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model".

The dataset contains 10 representative subsets selected from different components of the complete SFT and pre-training corpus, covering high-quality question-answering data, extracted content focused on the mathematical field, code metadata, and SFT-style instruction data, suitable for review and quick experiments.

Nemotron-Pretraining-Dataset-sample.torrent
Seeding 1Downloading 0Completed 11Total Downloads 46
  • Nemotron-Pretraining-Dataset-sample/
    • README.md
      1.37 KB
    • README.txt
      2.73 KB
      • data/
        • Nemotron-Pretraining-Dataset-sample.zip
          79.87 MB