Command Palette
Search for a command to run...
Xing Han Lù

摘要
我们提出 BM25S,这是一个基于 Python 的高效 BM25 实现,仅依赖 NumPy 和 SciPy。BM25S 在索引阶段通过预先计算 BM25 分数并将其存储为稀疏矩阵,相较于目前最流行的 Python 框架,速度最高提升达 500 倍。此外,与广泛应用于主流商业产品的高度优化的 Java 实现相比,BM25S 也实现了显著的性能提升。最后,BM25S 通过引入一种新颖的分数偏移(score shifting)方法,将预计算机制扩展至非稀疏版本,精确复现了 Kamphuis 等人(2020)提出的五种 BM25 变体的实现。代码开源地址为:https://github.com/xhluca/bm25s。
代码仓库
基准测试
| 基准 | 方法 | 指标 | 
|---|---|---|
| retrieval-on-hotpotqa | BM25S | Queries per second: 20.88  | 
| retrieval-on-hotpotqa | Elasticsearch | Queries per second: 7.11  | 
| retrieval-on-hotpotqa | Rank-BM25 | Queries per second: 0.04  | 
| retrieval-on-natural-questions | Elasticsearch | Queries per second: 12.16  | 
| retrieval-on-natural-questions | Rank-BM25 | Queries per second: 0.10  | 
| retrieval-on-natural-questions | BM25S | Queries per second: 41.85  | 
| retrieval-on-quora-question-pairs | Elasticsearch | Queries per second: 21.8  | 
| retrieval-on-quora-question-pairs | BM25-PT | Queries per second: 6.49  | 
| retrieval-on-quora-question-pairs | Rank-BM25 | Queries per second: 1.18  | 
| retrieval-on-quora-question-pairs | BM25S | Queries per second: 183.53  | 
| text-retrieval-on-climate-fever | Lucene (BM25S) | nDCG@10: 16.2  | 
| text-retrieval-on-dbpedia | Lucene (BM25S) | nDCG@10: 31.9  | 
| text-retrieval-on-fever | Lucene (BM25S) | nDCG@10: 63.8  | 
| text-retrieval-on-hotpotqa | Lucene (BM25S) | nDCG@10: 62.9  | 
| text-retrieval-on-ms-marco | Lucene (BM25S) | NDCG@10: 22.8  | 
| text-retrieval-on-natural-questions | Lucene (BM25S) | NDCG@10: 30.5  | 
| text-retrieval-on-nfcorpus | Lucene (BM25S) | nDCG@10: 31.8  | 
| text-retrieval-on-quora-question-pairs | Lucene (BM25S) | nDCG@10: 78.7  | 
| text-retrieval-on-scidocs | Lucene (BM25S) | nDCG@10: 67.6  | 
| text-retrieval-on-scifact | Lucene (BM25S) | nDCG@10: 15  | 
| text-retrieval-on-trec-covid | Lucene (BM25S) | nDCG@10: 58.9  |