Introducing ScriptureBench and ScriptureLM-1

IndexerLabs is pleased to announce the launch of two new projects.

The first is ScriptureBench, the first comprehensive evaluation test set designed to measure the recall and precision of AI-based scripture indexing systems against human-produced scripture indices. ScriptureBench is a curated dataset composed of both public-domain texts and constructed synthetic examples, the latter used to stress-test model behavior under controlled conditions and to prevent prior exposure to ground-truth indices.

This design enables empirical evaluation of both deterministic and non-deterministic extraction systems across a wide range of real-world and edge-case scenarios. Our goal is for ScriptureBench to serve as an industry-standard benchmark for assessing scripture indexing performance across leading frontier large language models.

ScriptureBench

ScriptureLM-199.1%
Gemini 3 Pro Thinking77.1%
GPT-5.2-Pro-XHigh73.3%
Claude 4.5 Opus41.2%
DeepSeek-V3.239.9%

Accuracy measured as the percentage of correctly identified scripture references against ground-truth human indexing.

We also announce the launch of our first domain-fine-tuned scripture indexing model, ScriptureLM-1. ScriptureLM-1 is a fine-tuned variant of the European open-source Mistral 3 Large model on a comprehensive suite of scripture indexing tasks and supporting tooling.

As of January 2026, on ScriptureBench, ScriptureLM-1 achieves SOTA results at 99.1% accuracy, significantly outperforming Gemini 3 Pro Thinking (77.1%), GPT-5.2-Pro-XHigh (73.3%), and Claude 4.5 Opus (41.2%), and DeepSeek-V3.2 (39.9%). Samples from our benchmarks can be found on the Scripture Indexing Demo page.

Beyond improving domain-specific accuracy and reducing inference latency, our fine-tuning approach enables IndexerLabs to operate the full scripture indexing pipeline entirely on our own infrastructure. By running ScriptureLM-1 on self-hosted servers, we retain full control over data handling while avoiding third-party data exposure. This approach allows us to adhere to the highest standards of data protection, as outlined on our Data Protection page.

IndexerLabs believes that while leading frontier models are highly capable and will power much of our future technology, in their raw, out-of-the-box state—whether accessed via web interfaces or general-purpose APIs—they often fall short of the precision and domain-specific reliability required for specialized tasks such as scripture indexing.

We believe this limitation also extends to subject indexing, and that meaningful progress in this domain will require verifiable benchmarks, domain-specific fine-tuning, and tooling designed explicitly for structured subject indexing at scale.