Skip to main content

RAG Pipeline — Status & Evaluation

Last updated: 2026-03-10

Architecture

User Question


┌──────────────┐ gpt-4o-mini ┌──────────────────┐
│ QueryRefiner │ ───────────────► │ RefineResult │
│ (Layer 1) │ │ query, category, │
│ │ │ list_code, │
└──────────────┘ │ corrections │
│ └──────────────────┘

┌──────────────┐ all-MiniLM ┌──────────────────┐
│ ChromaDB │ ───────────────► │ RetrievalResult[] │
│ (Layer 2) │ L6-v2 │ content, metadata, │
│ │ │ distance │
└──────────────┘ └──────────────────┘


┌──────────────┐ Mistral ┌──────────────────┐
│ Synthesis │ medium │ ChatResult / │
│ (Layer 3) │ ───────────────► │ CompareResult │
│ │ │ + retrieval │
└──────────────┘ │ metrics │
└──────────────────┘

Provider: Mistral medium-latest (default), failover chain: mistral → ollama → openai → claude → gemini

Tracing: All traces logged to Opik project ocapistaine-test

Corpus

SourceSlugChunksDocumentsType
Programme co-construitaudierne202628051README, contributions, PDF extracts
Passons à l'Action !paa5524OCR (Mistral Document AI)
Construire l'Avenirca3118OCR
S'unir pour Audierne-Esquibienspae2716OCR
Cap sur Notre Futurcsnf66OCR
Municipal context(empty)11256Deliberations, council reports
Total511171

Known gaps:

  • CSNF has only 6 chunks (1%) — responses for Bosser's list will be thin
  • 67 documents have empty category field (all OCR programme chunks)
  • Electoral lists overview added 2026-03-10 (reference document from ext_data/LISTS.md)

Ingestion

  • JSONL source: data/audierne2026/rag/documents.jsonl (171 docs)
  • Rebuild: python scripts/rebuild_programs_jsonl.py --apply (reads from /dev/audierne/docs/programmes/)
  • Ingest: python -m app.rag.ingest --reset (chunks at 1500 chars, 200 overlap)
  • Embedding: all-MiniLM-L6-v2 (384-dim, English-centric — adequate for French but not optimal)
  • Store: ChromaDB persistent at data/chromadb/, collection ocapistaine_docs

Query Refinement (Layer 1)

Pre-processes user input via OpenAI gpt-4o-mini before retrieval.

Four tasks in one LLM call:

  1. Spelling/grammar correction (accent restoration, proper-case names)
  2. Query reformulation (vague → precise for retrieval)
  3. Category detection (maps to 7 thematic categories)
  4. List detection (maps candidate names to list codes: ca, paa, spae, csnf)

Name gazetteer: Auto-loaded from ext_data/audierne2026/programmes/ colistier files. When a candidate is mentioned, the query is enriched with the full list name for better retrieval context.

Example: "Que propose Bosser ?""Que propose Eric Bosser (Cap sur Notre Futur) ?" + list_code: csnf

Code: app/agents/ocapistaine/features/refine.py

Retrieval Metrics (Layer 2)

Every retrieval logs structured metrics to Opik:

MetricMeaningGoodWeak
best_distanceClosest chunk to query< 0.3> 0.5
mean_distanceAverage across all chunks< 0.4> 0.5
distance_spreadmax - min distance< 0.2> 0.3
distance_gap_1_2Gap between #1 and #2> 0.05< 0.01
unique_listsDistinct electoral lists4+ (overview)1
above_threshold_countChunks above relevance threshold> 50-2

Confidence formula: 1 - best_distance (simple, effective)

Evaluation — Baseline Report (2026-03-10)

Trace Inventory

Source: 142 traces from Opik project ocapistaine-test (production usage since launch).

Trace TypeCountHas Metrics
rag_chat~90Yes (confidence, retrieval)
rag_overview~20Yes
rag_compare~20Partial (no confidence for some)
mockup_query_refine~30No (test noise, filtered out)

Confidence Distribution (111 traces with metrics)

RangeCountAssessment
>= 0.702Strong retrieval
0.60 – 0.7019Adequate
0.50 – 0.6059Marginal
< 0.5031Weak
  • Mean: 0.544
  • Median: 0.544
  • Min: 0.355 ("Jeunesse" single-word compare)
  • Max: 0.728 ("que propose les listes pour pierre le lec ?")

33% of queries were refined by the QueryRefiner (37/111).

Conversation Threads

84 threads total, 13 multi-turn (2-5 turns each).

Observed degradation pattern: In multi-turn threads, confidence tends to drop on follow-up questions. The refine step resolves pronouns ("Et lardic ?", "Est il plutot a droite ?") via conversation history, but retrieval operates on the refined query alone without benefiting from accumulated context.

Top Improvement Candidates

ThreadWorst ConfIssue
9f5267480.38"Qui es tu?" — no self-identity chunk
d7b37b260.47"Qui est Marc Arzel?" — person not in corpus
497141500.45Jeunesse queries — low list diversity
8d4b3d3e0.36Single-word "Jeunesse" in compare mode
51589cbb0.47"ou en est la campagne ?" — out-of-scope

Failure Pattern Analysis

PatternFrequencyRoot CauseFix
best_distance > 0.531 tracesCorpus gap or vague queryAdd reference docs, improve refine
unique_lists = 1~10 tracesList filter too narrowCheck list detection logic
Identity questions3 tracesNo "about" documentAdd self-description chunk
Campaign status2 tracesOut-of-scope for programme RAGDetect and respond gracefully
Person lookup3 tracesNames scattered across chunksAdded overview doc with all heads

Fix Applied: Electoral Lists Overview (2026-03-10)

Problem: "Quelles sont les listes ?" returned best_distance=0.502 (weak). No single chunk contained a structured overview of the four lists with their heads.

Fix: Added ext_data/LISTS.md as a reference document in the JSONL. Contains all four lists, heads, political nuance, slug mappings.

Result: The overview document now ranks #2 at distance=0.406 for list-related queries, containing all four heads of list (Lardic, Guillon, Van Praet, Bosser).

Experiment Framework

Datasets (Opik)

Two evaluation datasets were built from production traces on 2026-03-10:

DatasetItemsPurpose
rag-standalone-2026031063Turn-1 queries from each thread (independent, replayable)
rag-threaded-2026031079Follow-up turns with stored conversation history

Standalone dataset: Each item contains the original question, baseline confidence, retrieval metrics, and the source trace ID. Can be replayed independently to measure improvement after pipeline changes.

Threaded dataset: Each item contains the follow-up question plus the actual prior conversation (user questions + assistant responses from the original session). This solves the compounding-error problem: we test refine + retrieval with real history, not re-generated responses.

Experiment Design

┌─────────────────────────────────────────────────┐
│ Standalone Experiment │
│ │
│ For each item: │
│ 1. Run question through full RAG pipeline │
│ 2. Record new confidence, retrieval metrics │
│ 3. Compare with baseline from original trace │
│ 4. Score: confidence_improvement (delta) │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│ Threaded Experiment │
│ │
│ For each follow-up item: │
│ 1. Feed stored history to QueryRefiner │
│ 2. Run refined query through retrieval │
│ 3. Compare with baseline │
│ 4. Score: context_resolution (did refine │
│ correctly resolve pronouns/references?) │
└─────────────────────────────────────────────────┘

Custom Metrics

MetricFormulaMeasures
confidence_improvementnew_confidence - baseline_confidenceDid the pipeline change help?
context_resolutionRefine output matches expected expansionDoes history-aware refine work?

First Experiment: rag-after-overview-fix

  • Date: 2026-03-10
  • Change: Added electoral lists overview document to corpus
  • Dataset: rag-standalone-20260310 (63 queries)
  • Provider: Mistral medium-latest
  • Status: Task function completed (all 63 queries re-processed). Metric scoring had a serialization bug (output passed as string, not dict) — to fix in next iteration.

Running Experiments

# Via existing task framework
python -c "
from app.services.tasks import task_opik_evaluate
task_opik_evaluate(experiment_type='rag_chat_evaluation')
"

# Or via workflow_experiment directly
from app.processors.workflows import run_opik_experiment, OpikExperimentConfig
config = OpikExperimentConfig(
experiment_name='rag-my-experiment',
dataset_name='rag-standalone-20260310-230639',
experiment_type='rag_chat_evaluation',
metrics=['answer_relevance'],
task_provider='mistral',
)
result = run_opik_experiment(config)

Diagnostic Workflow

When a RAG response is wrong or incomplete, follow the Kvasir diagnostic:

  1. Get the trace: client.get_trace_content(trace_id) from project ocapistaine-test
  2. Check refine span: Was the query improved? Category/list detected?
  3. Check retrieval span: best_distance, unique_lists, chunks_found
  4. Reproduce locally: from app.rag.retrieval import search; search(query, n_results=10)
  5. Check corpus: Search ChromaDB for expected keywords
  6. Diagnose: Use the four-layer model (Refine → Retrieval → Metrics → Ingestion)
  7. Fix cheapest first: query refinement (free) → metadata fix (free) → threshold (free) → add documents (cheap) → re-chunking (medium)

See references/opik-trace-diagnosis.md in the Kvasir skill for the full procedure.

Roadmap

PriorityItemCostImpact
P0Fix metric serialization bugFreeUnblocks experiment scoring
P0Run threaded experimentLowValidates context resolution
P1Add self-identity document ("Qui es tu?")FreeFixes 3 failing traces
P1Fill empty category fields for OCR chunksMediumImproves category filtering
P1Add more CSNF documents (only 6 chunks)MediumBalances list representation
P2French-optimized embeddings (camembert/multilingual-e5)HigherBetter distance scores for French
P2Hybrid search (vector + keyword BM25)HigherCatches exact-match queries
P2Re-ranking with cross-encoderHigherBetter top-k precision