Aller au contenu principal

Clean the Windshield: Enhancing RAG Without Upgrading the Engine

· 8 minutes de lecture
Jean-Noël Schilling
Locki one / french maintainer

Three near-zero-cost interventions that improved our civic RAG pipeline more than any model change could

The Expensive Reflex

When a RAG pipeline underperforms, the instinct is to upgrade. Bigger embeddings. Longer context windows. A re-ranking model. Multi-query retrieval with hypothetical document embeddings. Each upgrade costs — in compute, in latency, in complexity, in the monthly invoice that arrives whether or not anyone asked a question that day.

We almost followed that path. Three days after launching OCapistaine's beta, we had traces in Opik showing retrieval quality all over the map. A citizen asks about "Van Praet et l'école" and the vector search looks for those exact semantics — missing that the candidate's name is actually "Van Praët" and that the relevant chunks use "établissements scolaires" instead of "école." The retrieval was technically working. It was the input that was broken.

The fix wasn't a better engine. It was a cleaner windshield.

Before Retrieval: The Cheapest LLM Call You'll Ever Make

A user typing on a phone at a market stall doesn't capitalize proper nouns. They abbreviate. They misspell Breton names and forget accents. They type "bosser ecole" and mean "What does candidate Bosser propose for the school?" — but the vector search dutifully looks for documents about working at school.

We added a pre-processing step: a single call to a cheap, fast model that does two things in one shot. First, it corrects wording — proper-casing candidate names, fixing accents, catching the kind of errors a human reader would silently resolve. Second, if the query is vague, it reformulates it into something a vector search can actually work with.

The key was injecting what the model doesn't know. We built a name gazetteer — 70 candidate names automatically extracted from the four electoral lists' published materials. These names travel inside the system prompt, so the model has an authoritative reference for correction. "van praet" becomes "Van Praët" not because the model guesses, but because it matches against a known list.

One LLM call. Fractions of a cent. Under 400 milliseconds. The query that hits the vector store is now clean, properly cased, and specific enough to retrieve what the citizen actually wanted.

The TRIZ principle at work is Prior Action: perform the cheap correction before the expensive retrieval, so the expensive step operates on clean input. The anonymization pipeline taught us this same lesson months ago — a fast, deterministic first pass before the heavy inference. The pattern keeps proving itself.

During Retrieval: Fourteen Metrics for Free

Here's what surprised us most: the vector search was already producing everything we needed to judge its own quality. We just weren't looking.

Every ChromaDB query returns not just the matching chunks but the distances — the semantic gap between the query and each result. These distances are pure gold. From them, alone, with zero additional LLM calls, we compute fourteen metrics:

  • Best distance: how close was the nearest match? A proxy for whether the corpus contains relevant information at all.
  • Distance spread: the gap between the best and worst results. A tight spread means the results are uniformly relevant (or uniformly irrelevant). A wide spread means the search found one gem and nine noise chunks.
  • Distance gap between rank 1 and rank 2: when this is large, the top result dominates — the query had one clear answer. When it's small, multiple documents compete — the topic spans several sources.
  • Source diversity: unique documents, unique electoral lists, unique categories. A retrieval that pulls ten chunks from one document is less useful than one that pulls two chunks from five documents.
  • Density: what fraction of results falls below a relevance threshold? This separates "I found ten things, three of which matter" from "I found ten things, eight of which matter."

None of this requires a model. It's arithmetic on data the search already produced. The metrics attach to every Opik trace as structured output on the retrieval span, and three of them — confidence, diversity, density — are logged as filterable feedback scores.

Now, in the Opik UI, we can filter all traces where retrieval.density fell below 0.3 and see: what did the citizen ask? Were the chunks actually irrelevant, or was the threshold too aggressive? Was it a query quality issue (which the refiner should have caught) or a corpus gap (which means we're missing documents)?

The metrics cost nothing but illuminate everything.

After Retrieval: Making the Prompts Manageable

The third intervention is structural, not algorithmic. Every prompt in OCapistaine's pipeline — the refiner's system prompt, the chat synthesis prompt, the comparison prompt, the overview prompt — is stored as JSON, synced to Opik's prompt library, and loadable from there at runtime. The code holds a hardcoded fallback, but the live version comes from a registry that Opik can version, track, and attach to experiments.

Why does this matter for RAG quality? Because prompt wording is retrieval's silent partner. The synthesis prompt that tells the model "cite your sources" produces better answers than one that doesn't — but you can only discover this through A/B testing. The refiner prompt that says "if the question mentions a person, search for their name in all documents" changes retrieval behavior even though it touches no retrieval code.

Prompt sync makes these variations testable. Push a new version to Opik. Run an experiment against a dataset of real citizen questions. Compare the scores. Roll back if it regresses. The prompt becomes code — versioned, diffable, evaluable.

The composite prompt ocapistaine-query-refine combines the refiner's system prompt with a user template and lives in the Opik playground. An evaluator can type a misspelled query, see the JSON correction output, and judge: did it fix the right things? Did it preserve meaning? Did it over-correct? This is prompt development as a first-class workflow, not a hidden string in source code.

Separate Spans for Separate Judgments

One decision that shaped the whole evaluation strategy: we log wording correction and semantic refinement as separate Opik spans, even though they come from the same LLM call.

The query_wording span fires when the refiner corrects spelling or names. Its output is the corrections list: ["van praet → Van Praët", "ecole → école"]. This span can be judged on precision (did it change things that were actually wrong?) and recall (did it catch everything?).

The query_refine span fires when the refiner reformulates a vague query. Its output is the expanded question. This span is judged differently: did the reformulation improve retrieval quality? You can A/B test this by running the original and refined queries against the vector store and comparing mean distances.

Same LLM call, same cost, two independently evaluable dimensions. The evaluation criteria we designed for each are concrete enough to become automated Opik scorers:

For wording: did all corrected names match the gazetteer? Were any correct words wrongly changed? Was meaning preserved?

For refinement: is the refined query between 1.5x and 5x the original length? Does it contain domain-specific keywords that the original lacked? Does retrieval with the refined query produce lower mean distance than the original?

For retrieval: is confidence above 0.7? Is diversity above 0.5? Is density above 0.6? These thresholds are starting points — the real thresholds will emerge from correlating automated metrics with citizen feedback (the thumbs up and thumbs down we already collect).

The Peripheral Insight

The three interventions share a common pattern: none of them touched the core model. The embedding model is the same generic all-MiniLM-L6-v2. The synthesis model is whatever the failover chain provides — Ollama, OpenAI, Claude, Gemini. The vector database is the same local ChromaDB. Nothing at the center changed.

What changed was the periphery. The input to the search. The measurement of the search. The management of the prompts around the search. Like cleaning a windshield, adjusting the mirrors, and reading the map before driving — the car doesn't go faster, but you stop missing the turns.

This is an under-discussed aspect of RAG engineering. The literature focuses on retrieval architectures: dense vs. sparse, single-vector vs. multi-vector, vanilla vs. ColBERT vs. late interaction. These matter. But for a civic project running on donated compute and volunteer time, the peripheral improvements delivered more ROI per euro than any architectural change would have.

The refiner cost: fractions of a cent per query for a pre-existing model. The metrics cost: zero — arithmetic on data already returned. The prompt sync cost: developer time to register JSON files in a pipeline that already existed.

Total infrastructure added: one import line and a dictionary entry.

What Remains

The evaluation criteria exist on paper but not yet as automated scorers. The thresholds (0.7 confidence, 0.5 diversity, 0.6 density) are informed guesses waiting for empirical validation against citizen feedback. The refiner works but hasn't been A/B tested against raw queries at scale.

These are the right problems to have. They are measurement problems, not capability problems. The pipeline produces the data; the analysis catches up when time allows.

The deeper lesson is one that keeps recurring in this project: the most impactful improvements are often the most boring. Not a new model. Not a new architecture. A text file with 70 names. Fourteen lines of arithmetic. A JSON registry for prompts that were already written.

The windshield was dirty. We cleaned it. The road is clearer now.


Related: The RAG Adventure Begins | The Gazetteer Guard | The Conversation Loop