CamemBERT Embedding — Future Exploration
Status: Parked (2026-03-12). Current priority is chunking quality, not embedding swap.
Context
The current embedding model all-MiniLM-L6-v2 (384 dims, 80 MB) is English-first. French semantic queries suffer — "inondation" ranks #22/66 instead of top 5 for SPAE flood content.
Candidates Evaluated
| Model | Dims | Size | French quality | Notes |
|---|---|---|---|---|
| all-MiniLM-L6-v2 (current) | 384 | 80 MB | Poor | English-only training |
| multilingual-e5-small | 384 | 450 MB | Good | Same dims, needs query:/passage: prefixes |
| multilingual-e5-base | 768 | 1.1 GB | Very good | Best quality/size tradeoff |
| camembert-base-mmarcoFR | 768 | 440 MB | Very good (retrieval-tuned) | French passage retrieval on mMARCO-FR |
| sentence-camembert-large | 1024 | 1.3 GB | Excellent (STS) | French-native, top scorer on French STS |
| BGE-M3 | 1024 | 2.2 GB | Very good | 8k context, hybrid dense+sparse |
Quick Test Results (2026-03-12)
antoinelouis/biencoder-camembert-base-mmarcoFR tested on the inondation query against SPAE chunks:
- Did not improve ranking vs MiniLM on this specific query
- The inondation chunk did not appear in top 10
- Possible cause: chunk content starts with agriculture/fishing preamble, burying the flood terms
Conclusion
The bottleneck was chunking, not embeddings. Header-aware splitting (on ##/###) fixed the inondation ranking from #22 to #8. Embedding swap remains a valid improvement but is lower priority.
When to Revisit
- If French semantic gaps persist after chunking improvements
- If Docker image size constraints allow a ~450 MB model
- If retrieval quality audits (Forseti) flag systematic French mismatches
References
- MTEB-French benchmark — 46 models evaluated
- ChromaDB embedding functions
- Any swap requires full re-ingest (
python -m app.rag.ingest --reset)