One of the hardest parts of evaluating AI-generated subject indexes is that there is no single obvious benchmark. Even two skilled human indexers can produce different indexes for the same book while both remaining professionally defensible. They may choose different phrasings, emphasize different themes, split or merge topics differently, or select and phrase entries in different ways.
That makes direct comparison for subject indexing harder to evaluate than tasks where there is a single correct output. Still, comparison is possible, and we think it should be done as rigorously as possible.
For one recent backtest, we ran our subject-indexing pipeline on The Oxford History of the French Revolution, a 425-page scholarly book with an existing professional printed index. Our goal was not to ask whether our system reproduced the printed index word-for-word, but whether it converged on a substantial portion of the same subjects, at a comparable length, with accurate locators. In other words, we wanted to see if our AI indexing system produced an index equivalent to a professional.
Full Index Comparison
Original Professional Index
IndexerLabs Generated Index
We compared the generated index and the printed human index in several ways.
First, we compared and controlled for overall top-level entry count and overall index length, since a much longer index can gain an artificial advantage simply by including more access points.
Second, we compared entry overlap. Some matches were exact string matches, while others required manual semantic adjudication. For example, slightly differently worded entries may still clearly refer to the same person or topic, such as "William V, Stadtholder of Orange" (professional index) and "William V (Prince of Orange)" (our index).
Third, within matched entries, we compared locator overlap to see how often the generated index pointed to the same pages as the printed index.
In this backtest, our generated index produced 1,058 top-level entries, compared with 1,066 in the printed human index. Of those 1,058 entries, 608 matched entries in the human index, counting both exact matches and manually adjudicated semantic equivalents. Within those matched entries, 98.36% of locators overlapped exactly, and the generated index came out to nearly the same overall page length as the printed index.
Comparing only top level entries for a moment, one can highlight the overlap in entries between the two indexes in green, as shown below:
Marked Overlap Comparison
Original Index (Marked Overlap)
IndexerLabs Index (Marked Overlap)
We believe that this strongly suggests that there is substantial agreement between the generated and human indexes, and the observed overlap appears to be far greater than would be expected by chance.
We discuss this substantial overlap in greater detail in our blog post.