MassSpecGym: A benchmark for the discovery and identification of molecules
MassSpecGym: A benchmark for the discovery and identification of molecules

Abstract
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| de-novo-molecule-generation-from-ms-ms | Random chemical generation | Top-1 Accuracy: 0.00 Top-1 MCES: 28.59 Top-1 Tanimoto: 0.07 Top-10 Accuracy: 0.00 Top-10 MCES: 25.72 Top-10 Tanimoto: 0.10 |
| de-novo-molecule-generation-from-ms-ms | SELFIES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 33.28 Top-1 Tanimoto: 0.10 Top-10 Accuracy: 0.00 Top-10 MCES: 21.84 Top-10 Tanimoto: 0.15 |
| de-novo-molecule-generation-from-ms-ms | SMILES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 53.80 Top-1 Tanimoto: 0.07 Top-10 Accuracy: 0.00 Top-10 MCES: 21.97 Top-10 Tanimoto: 0.17 |
| de-novo-molecule-generation-from-ms-ms-1 | SMILES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 79.39 Top-1 Tanimoto: 0.03 Top-10 Accuracy: 0.00 Top-10 MCES: 52.13 Top-10 Tanimoto: 0.10 |
| de-novo-molecule-generation-from-ms-ms-1 | Random chemical generation | Top-1 Accuracy: 0.00 Top-1 MCES: 21.11 Top-1 Tanimoto: 0.08 Top-10 Accuracy: 0.00 Top-10 MCES: 18.25 Top-10 Tanimoto: 0.11 |
| de-novo-molecule-generation-from-ms-ms-1 | SELFIES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 38.88 Top-1 Tanimoto: 0.08 Top-10 Accuracy: 0.00 Top-10 MCES: 26.87 Top-10 Tanimoto: 0.13 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | DeepSets | Hit rate @ 1: 4.42 Hit rate @ 20: 30.76 Hit rate @ 5: 14.46 MCES @ 1: 15.04 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | MIST | Hit rate @ 1: 9.57 Hit rate @ 20: 41.12 Hit rate @ 5: 22.11 MCES @ 1: 12.75 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | Random | Hit rate @ 1: 3.06 Hit rate @ 20: 27.74 Hit rate @ 5: 11.35 MCES @ 1: 13.87 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | DeepSets + Fourier features | Hit rate @ 1: 6.56 Hit rate @ 20: 33.46 Hit rate @ 5: 16.46 MCES @ 1: 14.14 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | Fingerprint FFN | Hit rate @ 1: 5.09 Hit rate @ 20: 31.97 Hit rate @ 5: 14.69 MCES @ 1: 14.94 |
| molecule-retrieval-from-ms-ms-spectrum-on | DeepSets + Fourier features | Hit rate @ 1: 5.24 Hit rate @ 20: 28.21 Hit rate @ 5: 12.58 MCES @ 1: 22.13 |
| molecule-retrieval-from-ms-ms-spectrum-on | Fingerprint FFN | Hit rate @ 1: 2.54 Hit rate @ 20: 20.00 Hit rate @ 5: 7.59 MCES @ 1: 24.66 |
| molecule-retrieval-from-ms-ms-spectrum-on | MIST | Hit rate @ 1: 14.64 Hit rate @ 20: 59.15 Hit rate @ 5: 34.87 MCES @ 1: 15.37 |
| molecule-retrieval-from-ms-ms-spectrum-on | DeepSets | Hit rate @ 1: 1.47 Hit rate @ 20: 19.23 Hit rate @ 5: 6.21 MCES @ 1: 25.11 |
| molecule-retrieval-from-ms-ms-spectrum-on | Random | Hit rate @ 1: 0.37 Hit rate @ 20: 8.22 Hit rate @ 5: 2.01 MCES @ 1: 30.81 |
| ms-ms-spectrum-simulation-bonus-chemical | Precursor m/z | Hit Rate @ 1: 2.09 Hit Rate @ 20: 22.65 Hit Rate @ 5: 8.52 |
| ms-ms-spectrum-simulation-bonus-chemical | FFN Fingerprint | Hit Rate @ 1: 7.62 Hit Rate @ 20: 44.12 Hit Rate @ 5: 22.70 |
| ms-ms-spectrum-simulation-bonus-chemical | FraGNNet | Hit Rate @ 1: 31.93 Hit Rate @ 20: 82.70 Hit Rate @ 5: 63.20 |
| ms-ms-spectrum-simulation-bonus-chemical | GNN | Hit Rate @ 1: 3.63 Hit Rate @ 20: 33.77 Hit Rate @ 5: 13.55 |
| ms-ms-spectrum-simulation-on-massspecgym | GNN | Cosine Similarity: 0.19 Hit Rate @ 1: 3.95 Hit Rate @ 20: 26.27 Hit Rate @ 5: 11.92 Jensen-Shannon Similarity: 0.20 |
| ms-ms-spectrum-simulation-on-massspecgym | FFN Fingerprint | Cosine Similarity: 0.25 Hit Rate @ 1: 8.44 Hit Rate @ 20: 38.57 Hit Rate @ 5: 21.43 Jensen-Shannon Similarity: 0.24 |
| ms-ms-spectrum-simulation-on-massspecgym | Precursor m/z | Cosine Similarity: 0.15 Hit Rate @ 1: 0.38 Hit Rate @ 20: 7.17 Hit Rate @ 5: 1.72 Jensen-Shannon Similarity: 0.15 |
| ms-ms-spectrum-simulation-on-massspecgym | FraGNNet | Cosine Similarity: 0.52 Hit Rate @ 1: 46.64 Hit Rate @ 20: 83.58 Hit Rate @ 5: 72.56 Jensen-Shannon Similarity: 0.47 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.