Oxford History of the French Revolution
One of the hardest parts of evaluating AI-generated subject indexes is that there is no single obvious benchmark. Even two skilled human indexers can produce different indexes for the same book while both remaining professionally defensible. They may choose different phrasings, emphasize different themes, split or merge topics differently, or select and phrase entries in different ways.
That makes direct comparison for subject indexing harder to evaluate than tasks where there is a single correct output. Still, comparison is possible, and we think it should be done as rigorously as possible.
For one recent backtest, we ran our subject-indexing pipeline on The Oxford History of the French Revolution, a 425-page scholarly book with an existing professional printed index. Our goal was not to ask whether our system reproduced the printed index word-for-word, but whether it converged on a substantial portion of the same subjects, at a comparable length, with accurate locators. In other words, we wanted to see if our AI indexing system produced an index equivalent to a professional.
Full Index Comparison
Original Professional Index
IndexerLabs Generated Index
We compared the generated index and the printed human index in several ways.
First, we compared and controlled for overall top-level entry count and overall index length, since a much longer index can gain an artificial advantage simply by including more access points.
Second, we compared entry overlap. Some matches were exact string matches, while others required manual semantic adjudication. For example, slightly differently worded entries may still clearly refer to the same person or topic, such as "William V, Stadtholder of Orange" (professional index) and "William V (Prince of Orange)" (our index).
Third, within matched entries, we compared locator overlap to see how often the generated index pointed to the same pages as the printed index.
In this backtest, our generated index produced 1,058 top-level entries, compared with 1,066 in the printed human index. Of those 1,058 entries, 608 matched entries in the human index, counting both exact matches and manually adjudicated semantic equivalents. Within those matched entries, 98.36% of locators overlapped exactly, and the generated index came out to nearly the same overall page length as the printed index.
Comparing only top level entries for a moment, one can highlight the overlap in entries between the two indexes in green, as shown below:
Marked Overlap Comparison
Original Index (Marked Overlap)
IndexerLabs Index (Marked Overlap)
We believe that this strongly suggests that there is substantial agreement between the generated and human indexes, and the observed overlap appears to be far greater than would be expected by chance.
We discuss this substantial overlap in greater detail in our blog post.
Neomania
IndexerLabs Generated Index
Original Human Index
Notes & Analysis
Notes
- The book used for this test is Neomania, by Dr. Krist Vaesen, published by Open Book Publishers.
- No human intervention or editing occurred when generating the index, aside from setting the target page range for generation.
- The book's original index was removed before generation to ensure our system could not be influenced by (or inadvertently reproduce) the existing index.
- The original Neomania index contains no subentries, so we disabled subentry generation to enable a fair, direct comparison. If you would like to see an example with subentries, see our other demo for this book with subentries.
Analysis
While indexing is inherently subjective, two professional indexes of the same manuscript should still show substantial overlap because the underlying text is unchanged. That makes it reasonable to compare our AI-generated index against a professionally produced human index.
A few observations:
- There is significant overlap in entries and locators.
Many entries appear in both indexes with identical phrasing and the same page references, for example: "arXiv," "Bacon, Francis," "bullshit jobs," "academic freedom," and "Daston, Lorraine." Overall, the two indexes have approximately 55% exact-match overlap (same heading phrasing and matching locators).
- Some differences reflect indexing judgment and specificity.
One notable example is "coordination." The human index groups the broader theme under a single entry ("coordination"), whereas our system takes a narrower approach and indexes only the more specific concept ("Peirce-style coordination").
Looking closely at the human index locators, page 123 appears under both "coordination" and "Peirce-style coordination." But the passage on that page is explicitly about Peirce-style coordination:
... In this future, science funders incentivize researchers to engage in Peirce-style coordination, and to define and commit to a number of well-selected research programmes. Journals and universities do the same: they too change their incentive structure such that researchers are stimulated to take up research programme work...
Because the "coordination" mentioned here is clearly the Peirce-style variant, listing page 123 under both headings may be redundant. Because the instances of "coordination" the index points to are overwhelmingly Peirce-style coordination, the separate "coordination" heading does not add much navigational value. In practice, a large share of its locators overlap with "Peirce-style coordination," making the broader entry function like a close-duplicate rather than a distinct set of occurrences. Some of the remaining locators are also arguably low-value (for example, indexing the acknowledgements), which further weakens the case for keeping the broader heading as a standalone entry.
In other words, this may not be an error so much as a space-specificity tradeoff (or a difference in indexing style). Much of the remaining 45% difference is that of phrasing such as "Artificial Intelligence (AI)" versus "AI (Artificial Intelligence)", in addition to various judgement decisions.
Indexing is not purely mechanical, and neither is our system. It has a pronounced subjective streak which we suspect is attributable in part to its extensive fine-tuning on hundreds of indexes. Our system prioritizes themes, chooses emphases, and commits to particular interpretations of what the reader will look for. That judgment can produce different tradeoffs than a conservative "include everything" approach, but it also avoids the generic, flattening style that many automated indexes tend to produce.