The RAG Adventure Begins: When Documents Learn to Speak
How 60 social media screenshots became a searchable civic memory in one afternoon
The Question Nobody Could Answer
Six days before the municipal elections in Audierne-Esquibien, a citizen asks: "What do the four lists propose for the local economy?"
Simple question. Impossible answer — until today.
The proposals existed, scattered across Facebook posts, PDF flyers slipped under doors, Instagram stories, and one carefully co-constructed participatory program on audierne2026.fr. No single human had read everything from every list. No journalist had time to cross-reference. The information was public but effectively invisible, buried under the noise of campaign season.
This is the gap OCapistaine was built to fill. Not to judge. Not to recommend. To remember — and to help citizens compare.
From Crawling to Comprehension
For two months, OCapistaine had been a builder without a voice. It could crawl municipal documents, validate citizen contributions against a charter, anonymize personal data, trace every LLM call through Opik. It had a robust provider failover chain, a scheduler, a Streamlit interface, Redis coordination. What it could not do was the one thing citizens actually needed: answer a question.
The missing piece was retrieval. Not keyword search — semantic understanding. A citizen asking about "commerce local" should find a candidate's proposal about "dynamiser le centre-bourg" even though the words don't overlap. This is what vector search does: it maps meaning into geometry, where proximity is relevance.
The Architecture of Memory
The pipeline we built today is deliberately minimal. Elections are in six days — elegance matters less than existence.
ChromaDB stores the vectors locally. No cloud dependency, no API key for retrieval, no latency penalty. The entire civic corpus fits in memory: 164 documents, 463 chunks, under 500MB with the embedding model.
The embedding model is ChromaDB's built-in all-MiniLM-L6-v2 running via ONNX — no GPU required, no Python 3.13 compatibility headaches, no heavyweight dependencies. It handles French adequately at this scale. Perfection is the enemy of the shipped.
Mistral OCR was the unexpected hero. Four electoral lists had published their programs as images on social media — photographs of printed flyers, screenshots of Facebook posts, scanned PDFs. Sixty files total. Mistral's Document AI processed every one without error, extracting text from JPGs and PDFs alike, outputting clean markdown. Two minutes of API calls turned visual noise into searchable knowledge.
Five Voices in One Index
The vector store now holds five distinct voices:
| Source | Documents | What it contains |
|---|---|---|
| Audierne-Esquibien 2026 | 103 | Co-constructed participatory program + citizen contributions |
| Construire l'Avenir (LDVG) | 17 | Campaign proposals + candidate presentations |
| Passons à l'Action ! (LDVD) | 24 | Editorial positions + candidate profiles |
| S'unir pour Audierne-Esquibien (LDVG) | 15 | Campaign platform + team introductions |
| Cap sur Notre Futur (LDVD) | 5 | Program manifesto + candidate list |
Each document carries a list_name tag. When a citizen asks to compare programs, OCapistaine queries each list separately, gathers the most relevant chunks, and asks the LLM to synthesize a neutral, factual comparison. No editorial. No ranking. Just: here is what each list says about your topic, with sources.
Tracing Every Conversation
Every chat interaction creates an Opik trace with the Streamlit session ID as thread_id. Inside each trace, two spans tell the story: a retrieval span logging which chunks were found and at what semantic distance, and a synthesis span recording which model answered, how many tokens it used, and what it said.
This matters beyond debugging. After the elections, these traces become a dataset. Which questions did citizens actually ask? Which topics had the richest coverage? Where did retrieval fail — returning distant chunks that didn't really answer the question? This is the feedback loop that will make the next iteration better. Opik doesn't just observe — it remembers what the system learned about its own limits.
What Beta Means
This is a beta. The embedding model is generic, not fine-tuned for French civic vocabulary. The chunking is character-based, not sentence-aware. The comparison mode doesn't yet weight document relevance. Some OCR extracts contain image references that add noise.
But beta means something specific here: it works well enough to be useful. A citizen can open a Streamlit page, type a question in French about their town's future, and receive a sourced answer drawing from every candidate's published program. That didn't exist yesterday.
The Deeper Current
There's a philosophical tension at the heart of this project that the RAG pipeline makes visible. Democracy requires informed citizens. Information requires access. Access requires someone — or something — to gather, structure, and present what candidates have said.
Traditionally, this was the role of the local press, which in rural France is thinning. Regional dailies like Le Télégramme and Ouest-France still cover Audierne through local correspondents, and the municipal bulletin Le Gwaien keeps residents informed of administrative news. But no independent outlet tracks the specifics of each electoral program, compares promises across lists, or makes that comparison searchable after the campaign ends. What fills that gap today? Facebook groups where information mixes with opinion. Word of mouth. Campaign flyers that arrive once and are lost.
OCapistaine doesn't replace journalism. It replaces forgetting. It ensures that what was said publicly stays findable, comparable, and accountable — not just during the campaign, but after.
The RAG adventure begins. The ship has left the harbor. Now we see where the current takes us.
Related: When the Bottleneck Moves | Programme Constitution
