RAG Pipeline — Status & Evaluation
Last updated: 2026-03-10
Architecture
User Question
│
▼
┌──────────────┐ gpt-4o-mini ┌──────────────────┐
│ QueryRefiner │ ───────────────► │ RefineResult │
│ (Layer 1) │ │ query, category, │
│ │ │ list_code, │
└──────────────┘ │ corrections │
│ └──────────────────┘
▼
┌──────────────┐ all-MiniLM ┌──────────────────┐
│ ChromaDB │ ───────────────► │ RetrievalResult[] │
│ (Layer 2) │ L6-v2 │ content, metadata, │
│ │ │ distance │
└──────────────┘ └──────────────────┘
│
▼
┌──────────────┐ Mistral ┌──────────────────┐
│ Synthesis │ medium │ ChatResult / │
│ (Layer 3) │ ───────────────► │ CompareResult │
│ │ │ + retrieval │
└──────────────┘ │ metrics │
└──────────────────┘
Provider: Mistral medium-latest (default), failover chain: mistral → ollama → openai → claude → gemini
Tracing: All traces logged to Opik project ocapistaine-test
Corpus
| Source | Slug | Chunks | Documents | Type |
|---|---|---|---|---|
| Programme co-construit | audierne2026 | 280 | 51 | README, contributions, PDF extracts |
| Passons à l'Action ! | paa | 55 | 24 | OCR (Mistral Document AI) |
| Construire l'Avenir | ca | 31 | 18 | OCR |
| S'unir pour Audierne-Esquibien | spae | 27 | 16 | OCR |
| Cap sur Notre Futur | csnf | 6 | 6 | OCR |
| Municipal context | (empty) | 112 | 56 | Deliberations, council reports |
| Total | 511 | 171 |
Known gaps:
- CSNF has only 6 chunks (1%) — responses for Bosser's list will be thin
- 67 documents have empty
categoryfield (all OCR programme chunks) - Electoral lists overview added 2026-03-10 (reference document from
ext_data/LISTS.md)
Ingestion
- JSONL source:
data/audierne2026/rag/documents.jsonl(171 docs) - Rebuild:
python scripts/rebuild_programs_jsonl.py --apply(reads from/dev/audierne/docs/programmes/) - Ingest:
python -m app.rag.ingest --reset(chunks at 1500 chars, 200 overlap) - Embedding:
all-MiniLM-L6-v2(384-dim, English-centric — adequate for French but not optimal) - Store: ChromaDB persistent at
data/chromadb/, collectionocapistaine_docs
Query Refinement (Layer 1)
Pre-processes user input via OpenAI gpt-4o-mini before retrieval.
Four tasks in one LLM call:
- Spelling/grammar correction (accent restoration, proper-case names)
- Query reformulation (vague → precise for retrieval)
- Category detection (maps to 7 thematic categories)
- List detection (maps candidate names to list codes: ca, paa, spae, csnf)
Name gazetteer: Auto-loaded from ext_data/audierne2026/programmes/ colistier files. When a candidate is mentioned, the query is enriched with the full list name for better retrieval context.
Example: "Que propose Bosser ?" → "Que propose Eric Bosser (Cap sur Notre Futur) ?" + list_code: csnf
Code: app/agents/ocapistaine/features/refine.py
Retrieval Metrics (Layer 2)
Every retrieval logs structured metrics to Opik:
| Metric | Meaning | Good | Weak |
|---|---|---|---|
best_distance | Closest chunk to query | < 0.3 | > 0.5 |
mean_distance | Average across all chunks | < 0.4 | > 0.5 |
distance_spread | max - min distance | < 0.2 | > 0.3 |
distance_gap_1_2 | Gap between #1 and #2 | > 0.05 | < 0.01 |
unique_lists | Distinct electoral lists | 4+ (overview) | 1 |
above_threshold_count | Chunks above relevance threshold | > 5 | 0-2 |
Confidence formula: 1 - best_distance (simple, effective)
Evaluation — Baseline Report (2026-03-10)
Trace Inventory
Source: 142 traces from Opik project ocapistaine-test (production usage since launch).
| Trace Type | Count | Has Metrics |
|---|---|---|
rag_chat | ~90 | Yes (confidence, retrieval) |
rag_overview | ~20 | Yes |
rag_compare | ~20 | Partial (no confidence for some) |
mockup_query_refine | ~30 | No (test noise, filtered out) |
Confidence Distribution (111 traces with metrics)
| Range | Count | Assessment |
|---|---|---|
| >= 0.70 | 2 | Strong retrieval |
| 0.60 – 0.70 | 19 | Adequate |
| 0.50 – 0.60 | 59 | Marginal |
| < 0.50 | 31 | Weak |
- Mean: 0.544
- Median: 0.544
- Min: 0.355 (
"Jeunesse"single-word compare) - Max: 0.728 (
"que propose les listes pour pierre le lec ?")
33% of queries were refined by the QueryRefiner (37/111).
Conversation Threads
84 threads total, 13 multi-turn (2-5 turns each).
Observed degradation pattern: In multi-turn threads, confidence tends to drop on follow-up questions. The refine step resolves pronouns ("Et lardic ?", "Est il plutot a droite ?") via conversation history, but retrieval operates on the refined query alone without benefiting from accumulated context.
Top Improvement Candidates
| Thread | Worst Conf | Issue |
|---|---|---|
9f526748 | 0.38 | "Qui es tu?" — no self-identity chunk |
d7b37b26 | 0.47 | "Qui est Marc Arzel?" — person not in corpus |
49714150 | 0.45 | Jeunesse queries — low list diversity |
8d4b3d3e | 0.36 | Single-word "Jeunesse" in compare mode |
51589cbb | 0.47 | "ou en est la campagne ?" — out-of-scope |
Failure Pattern Analysis
| Pattern | Frequency | Root Cause | Fix |
|---|---|---|---|
best_distance > 0.5 | 31 traces | Corpus gap or vague query | Add reference docs, improve refine |
unique_lists = 1 | ~10 traces | List filter too narrow | Check list detection logic |
| Identity questions | 3 traces | No "about" document | Add self-description chunk |
| Campaign status | 2 traces | Out-of-scope for programme RAG | Detect and respond gracefully |
| Person lookup | 3 traces | Names scattered across chunks | Added overview doc with all heads |
Fix Applied: Electoral Lists Overview (2026-03-10)
Problem: "Quelles sont les listes ?" returned best_distance=0.502 (weak). No single chunk contained a structured overview of the four lists with their heads.
Fix: Added ext_data/LISTS.md as a reference document in the JSONL. Contains all four lists, heads, political nuance, slug mappings.
Result: The overview document now ranks #2 at distance=0.406 for list-related queries, containing all four heads of list (Lardic, Guillon, Van Praet, Bosser).
Experiment Framework
Datasets (Opik)
Two evaluation datasets were built from production traces on 2026-03-10:
| Dataset | Items | Purpose |
|---|---|---|
rag-standalone-20260310 | 63 | Turn-1 queries from each thread (independent, replayable) |
rag-threaded-20260310 | 79 | Follow-up turns with stored conversation history |
Standalone dataset: Each item contains the original question, baseline confidence, retrieval metrics, and the source trace ID. Can be replayed independently to measure improvement after pipeline changes.
Threaded dataset: Each item contains the follow-up question plus the actual prior conversation (user questions + assistant responses from the original session). This solves the compounding-error problem: we test refine + retrieval with real history, not re-generated responses.
Experiment Design
┌─────────────────────────────────────────────────┐
│ Standalone Experiment │
│ │
│ For each item: │
│ 1. Run question through full RAG pipeline │
│ 2. Record new confidence, retrieval metrics │
│ 3. Compare with baseline from original trace │
│ 4. Score: confidence_improvement (delta) │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Threaded Experiment │
│ │
│ For each follow-up item: │
│ 1. Feed stored history to QueryRefiner │
│ 2. Run refined query through retrieval │
│ 3. Compare with baseline │
│ 4. Score: context_resolution (did refine │
│ correctly resolve pronouns/references?) │
└─────────────────────────────────────────────────┘
Custom Metrics
| Metric | Formula | Measures |
|---|---|---|
confidence_improvement | new_confidence - baseline_confidence | Did the pipeline change help? |
context_resolution | Refine output matches expected expansion | Does history-aware refine work? |
First Experiment: rag-after-overview-fix
- Date: 2026-03-10
- Change: Added electoral lists overview document to corpus
- Dataset:
rag-standalone-20260310(63 queries) - Provider: Mistral medium-latest
- Status: Task function completed (all 63 queries re-processed). Metric scoring had a serialization bug (output passed as string, not dict) — to fix in next iteration.
Running Experiments
# Via existing task framework
python -c "
from app.services.tasks import task_opik_evaluate
task_opik_evaluate(experiment_type='rag_chat_evaluation')
"
# Or via workflow_experiment directly
from app.processors.workflows import run_opik_experiment, OpikExperimentConfig
config = OpikExperimentConfig(
experiment_name='rag-my-experiment',
dataset_name='rag-standalone-20260310-230639',
experiment_type='rag_chat_evaluation',
metrics=['answer_relevance'],
task_provider='mistral',
)
result = run_opik_experiment(config)
Diagnostic Workflow
When a RAG response is wrong or incomplete, follow the Kvasir diagnostic:
- Get the trace:
client.get_trace_content(trace_id)from projectocapistaine-test - Check refine span: Was the query improved? Category/list detected?
- Check retrieval span:
best_distance,unique_lists,chunks_found - Reproduce locally:
from app.rag.retrieval import search; search(query, n_results=10) - Check corpus: Search ChromaDB for expected keywords
- Diagnose: Use the four-layer model (Refine → Retrieval → Metrics → Ingestion)
- Fix cheapest first: query refinement (free) → metadata fix (free) → threshold (free) → add documents (cheap) → re-chunking (medium)
See references/opik-trace-diagnosis.md in the Kvasir skill for the full procedure.
Roadmap
| Priority | Item | Cost | Impact |
|---|---|---|---|
| P0 | Fix metric serialization bug | Free | Unblocks experiment scoring |
| P0 | Run threaded experiment | Low | Validates context resolution |
| P1 | Add self-identity document ("Qui es tu?") | Free | Fixes 3 failing traces |
| P1 | Fill empty category fields for OCR chunks | Medium | Improves category filtering |
| P1 | Add more CSNF documents (only 6 chunks) | Medium | Balances list representation |
| P2 | French-optimized embeddings (camembert/multilingual-e5) | Higher | Better distance scores for French |
| P2 | Hybrid search (vector + keyword BM25) | Higher | Catches exact-match queries |
| P2 | Re-ranking with cross-encoder | Higher | Better top-k precision |