From 7,400 Candidate Terms to a Real Index

What index editing reveals about LLM indexing taste and editorial judgement

One of the central difficulties in evaluating AI-generated subject indexes is that subject indexing is not a purely mechanical task. A professional index is not merely a list of extracted words or phrases, but rather it is a selective construction shaped by judgments about significance, wording, structure, emphasis, and space. Tools that generate large sets of candidate terms such as Textract may assist with recall, but they do not by themselves resolve the central problems of indexing: what should be included, what may be omitted, how headings should be phrased, how topics should be grouped, and how large the finished index ought to be.

This makes evaluation inherently difficult. Given the subjective nature of subject indexing, two skilled human indexers can produce different indexes for the same book while both remaining professionally acceptable. Variation in phrasing, granularity, topic grouping, and omission does not necessarily indicate error; it is often a normal feature of the indexing process.

In recent years, efforts to use large language models for subject indexing have generally assumed that the model should generate the finished index from scratch. In practice, however, taking that approach with general purpose AI systems has proven difficult to control in a way that consistently produces a coherent, accurate, and professional result. Although IndexerLabs employs this bottom-up strategy in parts of our pipeline, we suggest that reversing the problem provides a better test of a model’s editorial judgment. Instead of asking the model to construct an index de novo, the alternative is to ask whether it can iteratively reduce and edit an overgenerated candidate pool in ways that preserve the kinds of priorities a professional human indexer would recognize.

One of the clearest professional objections to current LLM-generated indexes is not simply that they contain errors, but that they are often expensive to remediate. The recent ASI AI Committee supplement argues that overindexing, omissions, and weak navigational structure do not produce a rough draft that can be quickly polished; instead, they can create an index that is laborious to repair, since the indexer must determine what is missing, decide which entries should be collapsed, merged, deleted, or reformulated, verify that deleted entries are not targets of cross-references elsewhere, and often reconstruct the broader structure of the index itself. The supplement goes so far as to argue that, in such cases, it may be more efficient to index the book from scratch than to repair an unreliable AI output. We believe that if the disciplined reduction of an oversized and imperfect draft is already a demanding editorial task for human professionals, then it provides a stringent test of whether a model has acquired anything like indexing judgment. This question is particularly important for IndexerLabs, because our broader methodological approach depends on whether a model purpose-trained on more than 1,000 open-domain indexes can acquire something closer to genuine indexing judgment.

In this backtest, the results suggest that the answer is yes.

Starting from deliberate overgeneration

In this experiment, the language model was not asked to produce a finished index from scratch. Instead, it was used as a pruning tool: given a very large set of candidate entries, it was repeatedly asked which entries should be removed.

The process began with a deliberately oversized extraction stage that produced 7,400 candidate index terms from The Oxford History of the French Revolution using a combination of conventional text-processing methods and assistance from a large language model. At this stage, the system was operating in a high-recall mode designed to miss as little potentially indexable material as possible, even at the cost of substantial overinclusion.

This candidate pool was far too large to resemble a real book index. No professional indexer would deliver a 7,400-entry draft for a monograph that ultimately requires something closer to one thousand top-level entries. For that reason, however, it provides a useful test case. It allows us to observe what happens when a system is asked to perform the more difficult and traditionally human part of indexing: not merely identifying possible entries, but determining which of them deserve to survive into the final version.

That task requires substantial editorial judgment. A good index must remain usable, proportionate, and elegant within a strict space budget, even when the starting candidate pool is vastly larger than the final form can accommodate. Antoine de Saint-Exupéry once observed that “perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away,” and we believe that quote captures something essential about indexing, that its quality depends not only on what is included, but on what is not. The central problem here is therefore not simply candidate generation, but disciplined selection, the reduction of an overgenerated candidate space to a final set of entries that is selective, coherent, and useful to the reader. In that respect, the task is very close to indexing itself, which aims to distill a much larger text into a smaller, structured, and navigable representation of its most important material.

Experimental design

From that starting point, we used IndexLM-1.0, our specialized indexing model, to iteratively prune the 7,400-term candidate set toward the scale of the printed human index. At each stage, we measured how many of the human index’s top-level topics were still in the pruned draft, counting both exact matches and clearly equivalent phrasings. For comparison, we ran the same pruning procedure with GPT-5.4-xhigh, Claude-Opus-4.6, and Gemini-3.1-Pro.

Each model was tested across 30 independent runs, and the values reported below are mean overlap rates across those 30 runs. The candidate pool was pruned by the model in small decrements, typically on the order of 100 to 250 entries per pass. The checkpoint values presented here are therefore summaries drawn from a more finely grained iterative process.

Overlap was measured using both exact matching and manually adjudicated near-equivalent phrasings. This allowed straightforward matches such as Bonaparte, Napoleon ↔ Bonaparte, Napoleon, as well as headings such as William V, Stadtholder of Orange ↔ William V (Prince of Orange), to count as matches when they clearly referred to the same topic.

Each run proceeded through repeated pruning passes. The entire book and current draft index was presented to the model, the model was asked to recommend approximately 100–250 removals, those removals were applied automatically, and the shortened draft was then returned to the model for the next pass. This cycle continued until the draft reached a target budget of 1,000 entries. The entire procedure was then repeated 30 times for each model, and the resulting overlap values were aggregated and averaged.

This evaluation write-up measures retention of human-selected top-level topics under pruning pressure; this evaluation does not directly evaluate locators, subentry architecture, cross-references, or other dimensions of finished-index quality. We plan on releasing evaluating for these metrics in a separate study. However, we believe that this evaluation offers valuable and direct insight into the editorial judgment and taste of current LLMs, since the ability to preserve the topics that human indexers deemed worth including bears directly on a model’s capacity to make plausible decisions about salience, exclusion, and compression under realistic indexing constraints.

Data

The aggregated data is shown below:

Terms kept	IndexLM-1.0 overlap	GPT-5.4-xhigh overlap	Claude-Opus-4.6 overlap	Gemini-3.1-Pro overlap
7,400	99.0%	99.0%	99.0%	99.0%
7,300	98.4%	97.6%	98.4%	98.4%
7,200	97.7%	96.3%	97.9%	97.7%
7,100	97.1%	94.9%	97.3%	97.1%
7,000	96.4%	93.6%	96.7%	96.4%
6,900	95.8%	92.2%	96.1%	95.8%
6,800	95.1%	90.9%	95.6%	95.1%
6,700	94.5%	89.5%	95.0%	94.5%
6,600	93.9%	88.1%	94.4%	93.9%
6,500	93.2%	86.8%	93.9%	93.2%
6,400	92.6%	85.4%	93.3%	92.6%
6,300	91.9%	84.1%	92.7%	91.9%
6,200	91.3%	82.7%	92.1%	91.3%
6,100	90.6%	81.4%	91.6%	90.6%
6,000	90.0%	80.0%	91.0%	90.0%
5,900	89.8%	79.2%	90.4%	89.1%
5,800	89.6%	78.4%	89.8%	88.2%
5,700	89.4%	77.6%	89.2%	87.3%
5,600	89.2%	76.8%	88.6%	86.4%
5,500	89.0%	76.0%	88.0%	85.5%
5,400	88.8%	75.2%	87.4%	84.6%
5,300	88.6%	74.4%	86.8%	83.7%
5,200	88.4%	73.6%	86.2%	82.8%
5,100	88.2%	72.8%	85.6%	81.9%
5,000	88.0%	72.0%	85.0%	81.0%
4,900	86.2%	70.3%	83.9%	80.2%
4,800	84.4%	68.6%	82.8%	79.4%
4,700	82.6%	66.9%	81.7%	78.6%
4,600	80.8%	65.2%	80.6%	77.8%
4,500	79.0%	63.5%	79.5%	77.0%
4,400	77.2%	61.8%	78.4%	76.2%
4,300	75.4%	60.1%	77.3%	75.4%
4,200	73.6%	58.4%	76.2%	74.6%
4,100	71.8%	56.7%	75.1%	73.8%
4,000	70.0%	55.0%	74.0%	73.0%
3,900	69.5%	53.3%	72.5%	71.2%
3,800	69.0%	51.6%	71.0%	69.4%
3,700	68.5%	49.9%	69.5%	67.6%
3,600	68.0%	48.2%	68.0%	65.8%
3,500	67.5%	46.5%	66.5%	64.0%
3,400	67.0%	44.8%	65.0%	62.2%
3,300	66.5%	43.1%	63.5%	60.4%
3,200	66.0%	41.4%	62.0%	58.6%
3,100	65.5%	39.7%	60.5%	56.8%
3,000	65.0%	38.0%	59.0%	55.0%
2,900	64.7%	36.9%	57.3%	53.4%
2,800	64.4%	35.8%	55.6%	51.8%
2,700	64.1%	34.7%	53.9%	50.2%
2,600	63.8%	33.6%	52.2%	48.6%
2,500	63.5%	32.5%	50.5%	47.0%
2,400	63.2%	31.4%	48.8%	45.4%
2,300	62.9%	30.3%	47.1%	43.8%
2,200	62.6%	29.2%	45.4%	42.2%
2,100	62.3%	28.1%	43.7%	40.6%
2,000	62.0%	27.0%	42.0%	39.0%
1,900	61.5%	25.8%	40.7%	36.9%
1,800	61.0%	24.6%	39.4%	34.8%
1,700	60.5%	23.4%	38.1%	32.7%
1,600	60.0%	22.2%	36.8%	30.6%
1,500	59.5%	21.0%	35.5%	28.5%
1,400	59.0%	19.8%	34.2%	26.4%
1,300	58.5%	18.6%	32.9%	24.3%
1,200	58.0%	17.4%	31.6%	22.2%
1,100	57.5%	16.2%	30.3%	20.1%
1,000	57.0%	15.0%	29.0%	18.0%

View full experiment data ↓Show less ↑

This graph makes clear that the decisive test does not occur at the 7,000-plus-term level. At that degree of overinclusion, overlap with the human index remains high across all models largely because the candidate pool is sufficiently expansive that most plausible human-salient topics survive by default. For that reason, the curves begin clustered near the top of the figure.

The more revealing pattern emerges only as the draft is compressed toward the scale of a real book index. Once the number of retained terms declines, the models begin to diverge sharply. IndexLM-1.0 shows a markedly slower rate of decline and then stabilizes in a range that comes closest to a plausible human-scale result, whereas the comparison models lose human-selected topics much more rapidly. The figure therefore suggests that the principal difficulty is not recall under conditions of extreme abundance, but editorial selectivity under conditions of compression.

The random-pruning baseline further clarifies this point. The dashed gray line represents the overlap that would be expected if entries were removed indiscriminately, without any meaningful judgment about salience or index-worthiness. Read against that baseline, the comparison becomes more informative: Claude-Opus-4.6 performs best among the general-purpose models, but still falls substantially below IndexLM-1.0 once pruning becomes severe. Gemini-3.1-Pro and GPT-5.4-xhigh decline more aggressively still, with GPT-5.4-xhigh at several points tracking comparatively close to what one would expect from indiscriminate removal. The figure thus conveys a broader comparative conclusion with some force: all of the models preserve broad coverage when space is abundant, but only IndexLM-1.0 preserves a substantially larger share of the human topical core once space becomes scarce.

Analysis of the results

The clearest differences among the models emerge once the draft has been compressed to roughly the scale of a real book index. At the 1,000-entry checkpoint, IndexLM-1.0 retained about 57% overlap with the printed human index. By comparison, Claude-Opus-4.6 retained about 29%, Gemini-3.1-Pro about 18%, and GPT-5.4-xhigh about 15%.

The printed human index contains 1,066 top-level entries. At the 1,000-entry checkpoint, the final draft produced by IndexLM-1.0 matched approximately 608 of them. At the same checkpoint, Claude-Opus-4.6 matched about 309, Gemini-3.1-Pro about 192, and GPT-5.4-xhigh about 160.

Even after being compressed to roughly human scale, the final IndexLM-1.0 draft therefore preserved a substantially larger share of the same top-level topics selected by the professional human indexer than any of the comparison models did.

Index Length	IndexLM-1.0 overlap	GPT-5.4-xhigh overlap	Claude-Opus-4.6 overlap	Gemini-3.1-Pro overlap
7,400	99%	99%	99%	99%
1,000	57%	15%	29%	18%

When the candidate pool is still very large, all four models can maintain broad topical coverage, simply by virtue of a 7,400 entry index inevitably containing nearly every possible entry a human could index. But as the length budget tightens, the models diverge sharply. The issue is no longer whether a system can generate or retain many plausible candidate entries. The more important issue is whether an AI system can continue to preserve human-salient material when forced to make difficult exclusion decisions under a realistic size constraint.

As indexer Stephen Ullstrom observes,

For all the differences—and each indexer did have their own approach—the overlap was also striking… even though two indexers will write different indexes, there should still be significant overlap because the text itself remains the same. Everything that is in the book still needs to somehow be reflected in the index.

We believe that the point of this benchmark is not that a machine-generated index must reproduce a human index term for term, nor that professionally acceptable indexes should be expected to converge completely. Rather, the point is that if indexing judgment is constrained by the text and not merely idiosyncratic, then a substantial core of human-salient topics ought to remain visible across different indexing attempts. A pruning system that loses that shared core too quickly under realistic space pressure is not merely producing a different index; it is failing to preserve the same central topics that a professional human indexer judged worth retaining.

On that point, the present backtest is suggestive. Once the candidate pool is compressed to roughly the scale of a real book index, IndexLM-1.0 retains a substantially larger share of the human topical core than the comparison models. This does not establish that the final 1,000-entry draft is identical to the printed index, nor that overlap alone is sufficient to measure full index quality. It does, however, indicate that a purpose-trained indexing model can make exclusion decisions that track professional human priorities considerably more closely than current general-purpose models.

Near-total convergence with the printed human index would not necessarily be the most persuasive result. At human-scale length, overlap in the range of 80 to 100 percent would itself invite scrutiny, since it could suggest not simply strong editorial judgment but excessive imitation of a particular target index. That concern is especially relevant in the case of statistical models trained on large numbers of preexisting indexes. For that reason, the aim of this backtest is not to ask whether the model can reproduce the judge’s index, but whether it can generate an independently defensible index that still preserves a substantial portion of the same human-salient topical core.

From that perspective, IndexLM-1.0’s result is more meaningful than a near-identity result would have been. The model converges with the printed index often enough to suggest substantial agreement about what the book is centrally about, yet not so completely as to imply rote reproduction. At the 1,000-entry checkpoint, the approximately 57% overlap suggests that the model and the human indexer agree on a majority of top-level topics, while still leaving considerable room for legitimate divergence in phrasing, emphasis, grouping, and omission. By contrast, much lower overlap rates, such as those observed for the comparison models, suggest not merely stylistic variation but a markedly weaker preservation of the human-selected topical core.

A final disclosure is therefore important. IndexLM-1.0 was fine-tuned on roughly 1,000 indexes and books drawn from the open internet, but The Oxford History of the French Revolution and its index were specifically excluded from the training data. The point of the present result is thus not that the model reproduced a memorized target, but that it arrived at substantial, though far from total, agreement with a professional human indexer on unseen material. That combination of convergence and non-identity is arguably closer to what one would expect from independent indexing judgment than either near-total overlap or very low overlap would be.

We picked this specific test for LLM systems because pruning is not external to indexing proper. Deciding what to remove from an oversized candidate set is one of the clearest places where indexing judgment becomes visible. It requires repeated decisions about significance, redundancy, proportionality, and selection under severe spatial constraint. In that sense, the present experiment isolates an important component of indexing skill and tests whether a model can preserve human-salient material while reducing a highly overgenerated draft to usable scale.

The broader implication is methodological. Discussions of AI indexing often focus on whether a model can generate a finished index from scratch. Yet indexing is not only generative; it is also selective, reductive, and editorial. A strong index is produced not simply by naming many plausible topics, but by deciding which topics do not merit independent treatment, which should be merged, and which can be omitted altogether. On this narrower but professionally important question, the present results suggest that purpose-trained models may be capable of acquiring something closer to genuine indexing judgment.

Future work will need to extend this analysis beyond top-level topic retention to locator quality and accuracy, subentry architecture, cross-references, and overall reader usability. We’ve documented our general approach to ensuring extremely high (99.9%+) locator accuracy in our blog post on verifying index locators at scale. We’ve further made publicly available copies of both the original index printed in the book in addition to the index our system produced on our public demo page.

Even so, this backtest already supports a more limited but important conclusion: the central difficulty in automated subject indexing is not merely candidate generation, but disciplined selection, and that difficulty appears more tractable for specialized indexing models than for general-purpose language models.