Skip to main content

Project Sync: Opik Progress & OCR Blockers

· 3 min read
Jean-Noël Schilling
Locki one / french maintainer

Quick catch-up between Johnny (@jnxmas) and Victor (@zcbtvag) covering Opik integration progress and OCR pipeline challenges.

Summary

Johnny showcased recent progress on Opik prompt optimization. A new architecture is in place where prompts are no longer hardcoded but managed via an Opik Prompt Library. The "Charter Validation" prompt has already been optimized using this system.

A new mock-up feature can automatically generate contributions (even with violations) from existing meeting reports. The goal is to create a robust dataset to test and improve the validation agent. However, this auto-generation currently produces repetitive content - a challenge that will need addressing by identifying and aggregating duplicate contributions.

Victor reported working on the PDF processing pipeline, specifically tackling PDFs that are images and require OCR. This has been a blocker, as several libraries have proven difficult to install. Text-based PDF extraction is working fine.

Johnny clarified that the highest priority documents for OCR are the "Gwaien" municipal magazines - they contain rich, precise historical data on the municipality's actions over the past six years, and their OCR should be relatively straightforward compared to other documents.

Alignment & Risk Assessment

AreaStatusNotes
AlignmentGoodTeam aligned on priorities: contribution processing + data pipeline for RAG
Risk: OCR BlockerHighInstallation/implementation issues slowing data ingestion
Risk: Repetitive DataMediumAuto-generated contributions could create biased dataset
MitigationIn ProgressFocus OCR on high-value "Gwaien" docs first; explore antigravity with Gemini Pro

Branch Status

  • Victor: feature/contribution-refinement
  • Johnny: feature/crawling-migration
  • Action: Coordinate merge before next call

Action Plan & Tasks

1. OCR Pipeline (Priority: High)

  • Task: Implement OCR for "Gwaien" PDF documents
    • Owner: @zcbtvag (Victor)
    • Description: Prioritize implementing a working OCR solution for the "Gwaien" magazine PDFs in the ext_data/ directory. Explore different libraries or the antigravity agent.
    • Deadline: Next call
    • Success Criteria: Text successfully extracted from image-based "Gwaien" PDFs.

2. Code Sharing & Tooling

  • Task: Push antigravity agent code
    • Owner: @jnxmas (Johnny)
    • Description: Push changes containing antigravity agent experiments so Victor can explore it as a potential tool for OCR implementation.
    • Deadline: ASAP
    • Success Criteria: Victor can access and run the new agent code locally.

3. Branch Integration

  • Task: Merge feature branches
    • Owner: @zcbtvag & @jnxmas
    • Description: Coordinate to merge feature/contribution-refinement and feature/crawling-migration branches.
    • Deadline: Next call
    • Success Criteria: Single unified branch without conflicts.

4. Data Quality

  • Task: Develop duplicate detection strategy
    • Owner: @jnxmas (Johnny)
    • Description: Design a method to identify when a new contribution is a duplicate or highly similar to an existing one. Critical for managing auto-generation output and real citizen contributions.
    • Deadline: Next call
    • Success Criteria: Clear plan or PoC documented for identifying duplicate contributions.

Open Tasks from Previous Sessions

  • Finalize Firecrawl pipeline implementation
  • Secure Firecrawl API keys
  • Deploy initial Opik tracing for all major workflows

Status Dashboard

ComponentStatusNotes
Data Ingestion (Firecrawl/OCR)BlockedOCR implementation issues
Contribution Automation (Email-GitHub)DoneCore workflow established
Opik Integration & Prompt TuningDoneNew prompt library + optimization experiments
RAG ChatbotBlockedWaiting on data ingestion

Next Milestone: Unblock OCR pipeline to begin ingesting all PDF data for the RAG system.