RAG Pipeline — Status & Evaluation

Last updated: 2026-03-10

Architecture

User Question
    │
    ▼
┌──────────────┐   gpt-4o-mini    ┌──────────────────┐
│ QueryRefiner │ ───────────────► │ RefineResult      │
│ (Layer 1)    │                  │ query, category,  │
│              │                  │ list_code,        │
└──────────────┘                  │ corrections       │
    │                             └──────────────────┘
    ▼
┌──────────────┐   all-MiniLM    ┌──────────────────┐
│ ChromaDB     │ ───────────────► │ RetrievalResult[] │
│ (Layer 2)    │   L6-v2         │ content, metadata, │
│              │                  │ distance           │
└──────────────┘                  └──────────────────┘
    │
    ▼
┌──────────────┐   Mistral       ┌──────────────────┐
│ Synthesis    │   medium        │ ChatResult /      │
│ (Layer 3)    │ ───────────────► │ CompareResult     │
│              │                  │ + retrieval       │
└──────────────┘                  │   metrics         │
                                  └──────────────────┘

Provider: Mistral medium-latest (default), failover chain: mistral → ollama → openai → claude → gemini

Tracing: All traces logged to Opik project ocapistaine-test

Corpus

Source	Slug	Chunks	Documents	Type
Programme co-construit	`audierne2026`	280	51	README, contributions, PDF extracts
Passons à l'Action !	`paa`	55	24	OCR (Mistral Document AI)
Construire l'Avenir	`ca`	31	18	OCR
S'unir pour Audierne-Esquibien	`spae`	27	16	OCR
Cap sur Notre Futur	`csnf`	6	6	OCR
Municipal context	(empty)	112	56	Deliberations, council reports
Total		511	171

Known gaps:

CSNF has only 6 chunks (1%) — responses for Bosser's list will be thin
67 documents have empty category field (all OCR programme chunks)
Electoral lists overview added 2026-03-10 (reference document from ext_data/LISTS.md)

Ingestion

JSONL source: data/audierne2026/rag/documents.jsonl (171 docs)
Rebuild: python scripts/rebuild_programs_jsonl.py --apply (reads from /dev/audierne/docs/programmes/)
Ingest: python -m app.rag.ingest --reset (chunks at 1500 chars, 200 overlap)
Embedding: all-MiniLM-L6-v2 (384-dim, English-centric — adequate for French but not optimal)
Store: ChromaDB persistent at data/chromadb/, collection ocapistaine_docs

Pre-processes user input via OpenAI gpt-4o-mini before retrieval.

Four tasks in one LLM call:

Spelling/grammar correction (accent restoration, proper-case names)
Query reformulation (vague → precise for retrieval)
Category detection (maps to 7 thematic categories)
List detection (maps candidate names to list codes: ca, paa, spae, csnf)

Name gazetteer: Auto-loaded from ext_data/audierne2026/programmes/ colistier files. When a candidate is mentioned, the query is enriched with the full list name for better retrieval context.

Example: "Que propose Bosser ?" → "Que propose Eric Bosser (Cap sur Notre Futur) ?" + list_code: csnf

Code: app/agents/ocapistaine/features/refine.py

Retrieval Metrics (Layer 2)

Every retrieval logs structured metrics to Opik:

Metric	Meaning	Good	Weak
`best_distance`	Closest chunk to query	< 0.3	> 0.5
`mean_distance`	Average across all chunks	< 0.4	> 0.5
`distance_spread`	max - min distance	< 0.2	> 0.3
`distance_gap_1_2`	Gap between #1 and #2	> 0.05	< 0.01
`unique_lists`	Distinct electoral lists	4+ (overview)	1
`above_threshold_count`	Chunks above relevance threshold	> 5	0-2

Confidence formula: 1 - best_distance (simple, effective)

Evaluation — Baseline Report (2026-03-10)

Trace Inventory

Source: 142 traces from Opik project ocapistaine-test (production usage since launch).

Trace Type	Count	Has Metrics
`rag_chat`	~90	Yes (confidence, retrieval)
`rag_overview`	~20	Yes
`rag_compare`	~20	Partial (no confidence for some)
`mockup_query_refine`	~30	No (test noise, filtered out)

Confidence Distribution (111 traces with metrics)

Range	Count	Assessment
>= 0.70	2	Strong retrieval
0.60 – 0.70	19	Adequate
0.50 – 0.60	59	Marginal
< 0.50	31	Weak

Mean: 0.544
Median: 0.544
Min: 0.355 ("Jeunesse" single-word compare)
Max: 0.728 ("que propose les listes pour pierre le lec ?")

33% of queries were refined by the QueryRefiner (37/111).

Conversation Threads

84 threads total, 13 multi-turn (2-5 turns each).

Observed degradation pattern: In multi-turn threads, confidence tends to drop on follow-up questions. The refine step resolves pronouns ("Et lardic ?", "Est il plutot a droite ?") via conversation history, but retrieval operates on the refined query alone without benefiting from accumulated context.

Top Improvement Candidates

Thread	Worst Conf	Issue
`9f526748`	0.38	"Qui es tu?" — no self-identity chunk
`d7b37b26`	0.47	"Qui est Marc Arzel?" — person not in corpus
`49714150`	0.45	Jeunesse queries — low list diversity
`8d4b3d3e`	0.36	Single-word "Jeunesse" in compare mode
`51589cbb`	0.47	"ou en est la campagne ?" — out-of-scope

Failure Pattern Analysis

Pattern	Frequency	Root Cause	Fix
`best_distance > 0.5`	31 traces	Corpus gap or vague query	Add reference docs, improve refine
`unique_lists = 1`	~10 traces	List filter too narrow	Check list detection logic
Identity questions	3 traces	No "about" document	Add self-description chunk
Campaign status	2 traces	Out-of-scope for programme RAG	Detect and respond gracefully
Person lookup	3 traces	Names scattered across chunks	Added overview doc with all heads

Fix Applied: Electoral Lists Overview (2026-03-10)

Problem: "Quelles sont les listes ?" returned best_distance=0.502 (weak). No single chunk contained a structured overview of the four lists with their heads.

Fix: Added ext_data/LISTS.md as a reference document in the JSONL. Contains all four lists, heads, political nuance, slug mappings.

Result: The overview document now ranks #2 at distance=0.406 for list-related queries, containing all four heads of list (Lardic, Guillon, Van Praet, Bosser).

Experiment Framework

Datasets (Opik)

Two evaluation datasets were built from production traces on 2026-03-10:

Dataset	Items	Purpose
`rag-standalone-20260310`	63	Turn-1 queries from each thread (independent, replayable)
`rag-threaded-20260310`	79	Follow-up turns with stored conversation history

Standalone dataset: Each item contains the original question, baseline confidence, retrieval metrics, and the source trace ID. Can be replayed independently to measure improvement after pipeline changes.

Threaded dataset: Each item contains the follow-up question plus the actual prior conversation (user questions + assistant responses from the original session). This solves the compounding-error problem: we test refine + retrieval with real history, not re-generated responses.

Experiment Design

┌─────────────────────────────────────────────────┐
│ Standalone Experiment                           │
│                                                 │
│ For each item:                                  │
│   1. Run question through full RAG pipeline     │
│   2. Record new confidence, retrieval metrics   │
│   3. Compare with baseline from original trace  │
│   4. Score: confidence_improvement (delta)      │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│ Threaded Experiment                             │
│                                                 │
│ For each follow-up item:                        │
│   1. Feed stored history to QueryRefiner        │
│   2. Run refined query through retrieval        │
│   3. Compare with baseline                      │
│   4. Score: context_resolution (did refine      │
│      correctly resolve pronouns/references?)    │
└─────────────────────────────────────────────────┘

Custom Metrics

Metric	Formula	Measures
`confidence_improvement`	`new_confidence - baseline_confidence`	Did the pipeline change help?
`context_resolution`	Refine output matches expected expansion	Does history-aware refine work?

First Experiment: `rag-after-overview-fix`

Date: 2026-03-10
Change: Added electoral lists overview document to corpus
Dataset: rag-standalone-20260310 (63 queries)
Provider: Mistral medium-latest
Status: Task function completed (all 63 queries re-processed). Metric scoring had a serialization bug (output passed as string, not dict) — to fix in next iteration.

Running Experiments

# Via existing task framework
python -c "
from app.services.tasks import task_opik_evaluate
task_opik_evaluate(experiment_type='rag_chat_evaluation')
"

# Or via workflow_experiment directly
from app.processors.workflows import run_opik_experiment, OpikExperimentConfig
config = OpikExperimentConfig(
    experiment_name='rag-my-experiment',
    dataset_name='rag-standalone-20260310-230639',
    experiment_type='rag_chat_evaluation',
    metrics=['answer_relevance'],
    task_provider='mistral',
)
result = run_opik_experiment(config)

Diagnostic Workflow

When a RAG response is wrong or incomplete, follow the Kvasir diagnostic:

Get the trace: client.get_trace_content(trace_id) from project ocapistaine-test
Check refine span: Was the query improved? Category/list detected?
Check retrieval span: best_distance, unique_lists, chunks_found
Reproduce locally: from app.rag.retrieval import search; search(query, n_results=10)
Check corpus: Search ChromaDB for expected keywords
Diagnose: Use the four-layer model (Refine → Retrieval → Metrics → Ingestion)
Fix cheapest first: query refinement (free) → metadata fix (free) → threshold (free) → add documents (cheap) → re-chunking (medium)

See references/opik-trace-diagnosis.md in the Kvasir skill for the full procedure.

Roadmap

Priority	Item	Cost	Impact
P0	Fix metric serialization bug	Free	Unblocks experiment scoring
P0	Run threaded experiment	Low	Validates context resolution
P1	Add self-identity document ("Qui es tu?")	Free	Fixes 3 failing traces
P1	Fill empty category fields for OCR chunks	Medium	Improves category filtering
P1	Add more CSNF documents (only 6 chunks)	Medium	Balances list representation
P2	French-optimized embeddings (camembert/multilingual-e5)	Higher	Better distance scores for French
P2	Hybrid search (vector + keyword BM25)	Higher	Catches exact-match queries
P2	Re-ranking with cross-encoder	Higher	Better top-k precision

Architecture​

Corpus​

Ingestion​

Query Refinement (Layer 1)​

Retrieval Metrics (Layer 2)​

Evaluation — Baseline Report (2026-03-10)​

Trace Inventory​

Confidence Distribution (111 traces with metrics)​

Conversation Threads​

Top Improvement Candidates​

Failure Pattern Analysis​

Fix Applied: Electoral Lists Overview (2026-03-10)​

Experiment Framework​

Datasets (Opik)​

Experiment Design​

Custom Metrics​

First Experiment: rag-after-overview-fix​

Running Experiments​

Diagnostic Workflow​

Roadmap​