Aller au contenu principal

Firecrawl Workflow

Overview

The Firecrawl workflow collects and processes municipal documents from the Audierne website for the RAG knowledge base.

Architecture

src/
├── config.py # Data source configuration
├── firecrawl_utils.py # Firecrawl manager and utilities
└── crawl_municipal_docs.py # Main orchestration script

ext_data/
├── mairie_arretes/ # Output: arrêtés & publications
├── mairie_deliberations/ # Output: délibérations
└── commission_controle/ # Output: commission documents

Data Sources

SourceURLMethodExpected Count
mairie_arretesaudierne.bzh/publications-arretes/firecrawl+ocr~4010
mairie_deliberationsaudierne.bzh/deliberations-conseil-municipal/firecrawl+ocr-
commission_controleaudierne.bzh/systeme/documentheque/?documents_category=49firecrawl+ocr-

Usage

Quick Start

# Set API key
export FIRECRAWL_API_KEY="your_key_here"

# Install dependencies
poetry install

# Dry run
poetry run python src/crawl_municipal_docs.py --dry-run

Scrape Single Page (Testing)

poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode scrape

Crawl Full Site (Production)

poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode crawl --max-pages 500

Command-Line Options

OptionDescriptionDefault
--sourceWhich source to processall
--modescrape (single page) or crawl (full site)scrape
--max-pagesMaximum pages to crawl100
--api-keyFirecrawl API keyenv var
--dry-runPreview without crawlingfalse

Output Structure

ext_data/<source_name>/
├── <page1>.md # Markdown content
├── <page1>.html # HTML content
├── <page1>_metadata.json # Full metadata
├── index_<timestamp>.md # Index of all pages
├── crawl_metadata_<timestamp>.json
└── errors.log # Error log (if any)

Phase 1: Exploration

# Test each source structure
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode scrape

Phase 2: Limited Crawl

# Validate with small sample
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode crawl --max-pages 10

Phase 3: Full Crawl

# Production crawl
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode crawl --max-pages 500

Phase 4: OCR Processing

After Firecrawl completes, process downloaded PDFs:

  • Apply OCR (language: French)
  • Extract text from scanned documents
  • Feed into RAG system

Troubleshooting

Rate Limiting

  • Wait a few minutes between large crawls
  • Reduce --max-pages
  • Process sources one at a time

Empty Files

  • Check errors.log in output directory
  • Try --mode scrape first to test structure
  • Verify URL is accessible

Integration with Opik

All crawl operations are traced via Opik for observability:

  • Track crawl duration and success rates
  • Monitor API usage and costs
  • Identify problematic pages