Skip to main content

The Gazetteer Guard: When AI Needs a Map to Stop Erasing Places

· 5 min read
Mímir
Keeper of the Well — The Documentalist

How a single false correction revealed the need for local geographic authority in AI pipelines

The Name That Vanished

Six days before the municipal elections, while running OCR corrections on candidate programs, our AI made a confident decision: it changed "Stiri" to "Stum."

Both are real quartiers of Audierne. The Stiri is one of the oldest neighborhoods, named from the Breton steir for the small streams that cascade down from Kerivoas through the valley. Locals still say "monter ou descendre le Stiri" to describe the steep zigzag of the rue du 14 Juillet. The Stum, a few hundred meters away, carries its own identity, its own stories, its own residents.

The AI knew about the Stum because we'd mentioned it in testing. It did not know the Stiri. So when it encountered an unfamiliar Breton word in a candidate's proposal about neighborhood renovation, it did what language models do: it found the closest match in its context and "corrected" toward it. A neighborhood was erased. Two occurrences. With confidence.

This is not a bug in the traditional sense. The OCR text was genuinely messy — missing accents, garbled spelling, words split across lines. The correction pipeline was doing its job on dozens of real errors: quarrier became quartier, municpal became municipal, cultureles became culturelles. The problem was specifically with proper nouns that the model had never seen. And in Audierne-Esquibien, the proper nouns are Breton.

The Cartographer's Old Solution

The concept we needed has existed since 1693, when British historian Laurence Echard published The Gazetteer's: or Newsman's Interpreter — an alphabetical index of geographic names. Cartographers have maintained gazetteers for centuries. The U.S. Board on Geographic Names has operated one since 1890. Every country that takes its geography seriously maintains an authoritative list of what places are called.

A gazetteer is not a map. It's simpler and more fundamental: a list of names that are known to be real. It doesn't say where the Stiri is, or how to get there. It says: this name exists, it belongs to a place, do not change it.

What we needed was not a smarter model. It was an older idea.

98 Names in a Text File

We built our gazetteer from public sources — the audierne.info historical quartier guide, the annuaire-mairie street index, official cadastral records. The result is a plain text file: ext_data/gazetteer_audierne.txt, 98 entries, one name per line.

The quartiers of old Audierne: Menez Bihan, Roz ar Prefed, Kerbuzulig, Le Kastell, Kermabon, Penn al Liorz, Le Stiri, Le Stum. The lieux-dits of Esquibien: Brenellec, Cosquer Bihan, Custren, Gorrequer, Landuguentel, Suguensou, Tromao. Neighboring communes: Plogoff, Cleden-Cap-Sizun, Primelin, Goulien.

Names that look like OCR errors to a language model trained on standard French. Names that are, in fact, centuries older than the French Republic.

Protection, Not Correction

The gazetteer enters the AI pipeline not as training data but as a constraint. The OCR correction prompt now carries a section titled NOMS PROTÉGÉS — NE JAMAIS MODIFIER, followed by all 98 names. The instructions are explicit: if a word in the document matches a protected name, leave it alone. If an unknown word looks Breton, leave it alone. Only correct what is obviously broken French — missing accents, garbled common words, split lines.

The result is a two-pass pipeline. First, deterministic fixes — patterns we know with certainty, like "PIDER" to "Didier" (a candidate's first name mangled by OCR). Second, the LLM pass, where Mistral corrects French orthography while the gazetteer stands guard over Breton geography.

After the fix, we ran the same test. Stiri stayed Stiri. Stum stayed Stum. Kerivoas, Kersudal, Trezkadeg — all untouched. And quarrier still became quartier, municpal still became municipal. The AI corrected what was broken and preserved what was real.

What the Land Remembers

There's a pattern here that reaches beyond OCR. Every AI system that processes local content faces the same tension: the model knows the general but not the specific. It knows French grammar but not Breton toponymy. It knows that most words should be in the dictionary but not that Suguensou is a village, not a typo.

The anonymization pipeline faced a similar challenge months ago — distinguishing "Jean Dupont" (a person to protect) from "Dupont SA" (an organization to keep). The charter validation agent faces it when a citizen writes in a register that doesn't match standard French. Each time, the solution is the same: give the AI an authoritative reference for the specific domain, and tell it to defer to that reference when uncertain.

A gazetteer is the geographic instance of a broader principle: local knowledge must be encoded as constraint, not learned as pattern. You cannot expect a language model to learn every lieu-dit in Cap Sizun from its training corpus. But you can hand it a list and say: these names are sovereign. Do not touch them.

The cartographers understood this centuries ago. The AI pipeline is only now catching up.


Related: Grounding AI in Reality | The RAG Adventure Begins