Skip to main content

CamemBERT Embedding — Future Exploration

Status: Parked (2026-03-12). Current priority is chunking quality, not embedding swap.

Context

The current embedding model all-MiniLM-L6-v2 (384 dims, 80 MB) is English-first. French semantic queries suffer — "inondation" ranks #22/66 instead of top 5 for SPAE flood content.

Candidates Evaluated

ModelDimsSizeFrench qualityNotes
all-MiniLM-L6-v2 (current)38480 MBPoorEnglish-only training
multilingual-e5-small384450 MBGoodSame dims, needs query:/passage: prefixes
multilingual-e5-base7681.1 GBVery goodBest quality/size tradeoff
camembert-base-mmarcoFR768440 MBVery good (retrieval-tuned)French passage retrieval on mMARCO-FR
sentence-camembert-large10241.3 GBExcellent (STS)French-native, top scorer on French STS
BGE-M310242.2 GBVery good8k context, hybrid dense+sparse

Quick Test Results (2026-03-12)

antoinelouis/biencoder-camembert-base-mmarcoFR tested on the inondation query against SPAE chunks:

  • Did not improve ranking vs MiniLM on this specific query
  • The inondation chunk did not appear in top 10
  • Possible cause: chunk content starts with agriculture/fishing preamble, burying the flood terms

Conclusion

The bottleneck was chunking, not embeddings. Header-aware splitting (on ##/###) fixed the inondation ranking from #22 to #8. Embedding swap remains a valid improvement but is lower priority.

When to Revisit

  • If French semantic gaps persist after chunking improvements
  • If Docker image size constraints allow a ~450 MB model
  • If retrieval quality audits (Forseti) flag systematic French mismatches

References