ChemData Chemical Task Dataset
Date
Size
Publish URL
Categories
* This dataset supports online use.Click here to jump.
Dataset Introduction
This dataset was open-sourced by the Shanghai Artificial Intelligence Laboratory in 2024 together with its first scientific big model, the Pu Ke Chemical Big Model (ChemLLM). The related paper results are "ChemLLM: A Chemical Large Language Model".
The data set mainly includes ChemData700K. The research team also open-sourced the Chinese and English versions of ChemBench-4K, ChemPref-10K and the C-MHChem data set.
ChemData700K dataset
ChemData700K is a large language model chemistry capability instruction fine-tuning dataset that includes 9 core chemistry tasks and 730K high-quality questions and answers, sampled from 1/10 of 7 million data. The dataset covers a wide range of chemical domain knowledge and is divided into 3 main task categories (molecules, reactions, and domains).
ChemBench4K benchmark dataset
ChemBench is an innovative benchmark consisting of 9 tasks on chemical molecules and reactions. These 9 tasks are the same as those in ChemData. The benchmark provides a basis for objectively measuring the chemistry proficiency of LLM students. ChemBench contains 4,100 multiple-choice questions with one correct answer.
ChemPref-10K dataset
This dataset can be used to optimize language models to match human preferences and contains both English and Chinese versions.
C-MHChem dataset
C-MHChem is a high-quality, fully manually written, multiple-choice test benchmark consisting of 600 questions collected from junior high school, high school, and college entrance examinations in various parts of China over the past 25 years.