Date

a year ago

Size

242.89 MB

Organization

Paper URL

Dataset Introduction

This dataset was open-sourced by the Shanghai Artificial Intelligence Laboratory in 2024 together with its first scientific big model, the Pu Ke Chemical Big Model (ChemLLM). The related paper results are "ChemLLM: A Chemical Large Language Model".

The data set mainly includes ChemData700K. The research team also open-sourced the Chinese and English versions of ChemBench-4K, ChemPref-10K and the C-MHChem data set.

ChemData700K dataset

ChemData700K is a large language model chemistry capability instruction fine-tuning dataset that includes 9 core chemistry tasks and 730K high-quality questions and answers, sampled from 1/10 of 7 million data. The dataset covers a wide range of chemical domain knowledge and is divided into 3 main task categories (molecules, reactions, and domains).

ChemBench4K benchmark dataset

ChemBench is an innovative benchmark consisting of 9 tasks on chemical molecules and reactions. These 9 tasks are the same as those in ChemData. The benchmark provides a basis for objectively measuring the chemistry proficiency of LLM students. ChemBench contains 4,100 multiple-choice questions with one correct answer.

ChemPref-10K dataset

This dataset can be used to optimize language models to match human preferences and contains both English and Chinese versions.

C-MHChem dataset

C-MHChem is a high-quality, fully manually written, multiple-choice test benchmark consisting of 600 questions collected from junior high school, high school, and college entrance examinations in various parts of China over the past 25 years.

ChemLLM-Dataset.torrent

Seeding 1Downloading 0Completed 227Total Downloads 875

ChemLLM-Dataset/
- README.md
  2.09 KB
- README.txt
  4.18 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at support@hyper.ai for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset

Discuss on Discord

Date

a year ago

Size

242.89 MB

Organization

Paper URL

arxiv.org

Dataset Introduction

The data set mainly includes ChemData700K. The research team also open-sourced the Chinese and English versions of ChemBench-4K, ChemPref-10K and the C-MHChem data set.

ChemData700K dataset

ChemBench4K benchmark dataset

ChemPref-10K dataset

This dataset can be used to optimize language models to match human preferences and contains both English and Chinese versions.

C-MHChem dataset

ChemLLM-Dataset.torrent

Seeding 1Downloading 0Completed 227Total Downloads 875

ChemLLM-Dataset/
- README.md
  2.09 KB
- README.txt
  4.18 KB

Related Datasets

RealTimeFaceSwap-10k Video Call Spoofing Dataset

18 days ago

SimpleQA Concise Factual Question Answering Evaluation Dataset

a month ago

olmOCR-mix-1025 Document Recognition Dataset

3 months ago

71.74 GB82

LightOnOCR-mix-0126 Text Transcription Dataset

10 days ago

FrontierScience Inference Research Task Evaluation Dataset

2 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

ChemData Chemical Task Dataset

Dataset Introduction

ChemData700K dataset

ChemBench4K benchmark dataset

ChemPref-10K dataset

C-MHChem dataset

Build AI with AI

HyperAI Newsletters

Command Palette

ChemData Chemical Task Dataset

Dataset Introduction

ChemData700K dataset

ChemBench4K benchmark dataset

ChemPref-10K dataset

C-MHChem dataset

Related Datasets

RealTimeFaceSwap-10k Video Call Spoofing Dataset

SimpleQA Concise Factual Question Answering Evaluation Dataset

olmOCR-mix-1025 Document Recognition Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

FrontierScience Inference Research Task Evaluation Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

ChemData Chemical Task Dataset

Dataset Introduction

ChemData700K dataset

ChemBench4K benchmark dataset

ChemPref-10K dataset

C-MHChem dataset

Related Datasets

RealTimeFaceSwap-10k Video Call Spoofing Dataset

SimpleQA Concise Factual Question Answering Evaluation Dataset

olmOCR-mix-1025 Document Recognition Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

FrontierScience Inference Research Task Evaluation Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

RealTimeFaceSwap-10k Video Call Spoofing Dataset

SimpleQA Concise Factual Question Answering Evaluation Dataset

olmOCR-mix-1025 Document Recognition Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

FrontierScience Inference Research Task Evaluation Dataset

Related Datasets

RealTimeFaceSwap-10k Video Call Spoofing Dataset

SimpleQA Concise Factual Question Answering Evaluation Dataset

olmOCR-mix-1025 Document Recognition Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

FrontierScience Inference Research Task Evaluation Dataset