Project Sync: Opik Progress & OCR Blockers

January 31, 2026 · 3 min read

Locki one / french maintainer

Quick catch-up between Johnny (@jnxmas) and Victor (@zcbtvag) covering Opik integration progress and OCR pipeline challenges.

Summary

Johnny showcased recent progress on Opik prompt optimization. A new architecture is in place where prompts are no longer hardcoded but managed via an Opik Prompt Library. The "Charter Validation" prompt has already been optimized using this system.

A new mock-up feature can automatically generate contributions (even with violations) from existing meeting reports. The goal is to create a robust dataset to test and improve the validation agent. However, this auto-generation currently produces repetitive content - a challenge that will need addressing by identifying and aggregating duplicate contributions.

Victor reported working on the PDF processing pipeline, specifically tackling PDFs that are images and require OCR. This has been a blocker, as several libraries have proven difficult to install. Text-based PDF extraction is working fine.

Johnny clarified that the highest priority documents for OCR are the "Gwaien" municipal magazines - they contain rich, precise historical data on the municipality's actions over the past six years, and their OCR should be relatively straightforward compared to other documents.

Alignment & Risk Assessment

Area	Status	Notes
Alignment	Good	Team aligned on priorities: contribution processing + data pipeline for RAG
Risk: OCR Blocker	High	Installation/implementation issues slowing data ingestion
Risk: Repetitive Data	Medium	Auto-generated contributions could create biased dataset
Mitigation	In Progress	Focus OCR on high-value "Gwaien" docs first; explore antigravity with Gemini Pro

Branch Status

Victor: feature/contribution-refinement
Johnny: feature/crawling-migration
Action: Coordinate merge before next call

Action Plan & Tasks

1. OCR Pipeline (Priority: High)

Task: Implement OCR for "Gwaien" PDF documents
- Owner: @zcbtvag (Victor)
- Description: Prioritize implementing a working OCR solution for the "Gwaien" magazine PDFs in the ext_data/ directory. Explore different libraries or the antigravity agent.
- Deadline: Next call
- Success Criteria: Text successfully extracted from image-based "Gwaien" PDFs.

Task: Push antigravity agent code
- Owner: @jnxmas (Johnny)
- Description: Push changes containing antigravity agent experiments so Victor can explore it as a potential tool for OCR implementation.
- Deadline: ASAP
- Success Criteria: Victor can access and run the new agent code locally.

3. Branch Integration

Task: Merge feature branches
- Owner: @zcbtvag & @jnxmas
- Description: Coordinate to merge feature/contribution-refinement and feature/crawling-migration branches.
- Deadline: Next call
- Success Criteria: Single unified branch without conflicts.

4. Data Quality

Task: Develop duplicate detection strategy
- Owner: @jnxmas (Johnny)
- Description: Design a method to identify when a new contribution is a duplicate or highly similar to an existing one. Critical for managing auto-generation output and real citizen contributions.
- Deadline: Next call
- Success Criteria: Clear plan or PoC documented for identifying duplicate contributions.

Open Tasks from Previous Sessions

Finalize Firecrawl pipeline implementation
Secure Firecrawl API keys
Deploy initial Opik tracing for all major workflows

Status Dashboard

Component	Status	Notes
Data Ingestion (Firecrawl/OCR)	Blocked	OCR implementation issues
Contribution Automation (Email-GitHub)	Done	Core workflow established
Opik Integration & Prompt Tuning	Done	New prompt library + optimization experiments
RAG Chatbot	Blocked	Waiting on data ingestion

Next Milestone: Unblock OCR pipeline to begin ingesting all PDF data for the RAG system.

Summary​

Alignment & Risk Assessment​

Branch Status​

Action Plan & Tasks​

1. OCR Pipeline (Priority: High)​

2. Code Sharing & Tooling​

3. Branch Integration​

4. Data Quality​

Open Tasks from Previous Sessions​

Status Dashboard​