Skip to main content

Firecrawl Quick Reference

πŸš€ One-Time Setup​

# 1. Set API key
export FIRECRAWL_API_KEY="your_key_here"

# 2. Install dependencies (if not done)
poetry install

πŸ“ Common Commands​

Test Connection​

poetry run python /examples/simple_scrape.py

Dry Run (see what would happen)​

poetry run python src/crawl_municipal_docs.py --dry-run

Scrape Single Page (Testing)​

# Test one source
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode scrape

# Test all sources
poetry run python src/crawl_municipal_docs.py --source all --mode scrape

Crawl Full Site (Production)​

# One source, limited pages
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode crawl --max-pages 50

# All sources, up to 100 pages each
poetry run python src/crawl_municipal_docs.py --source all --mode crawl --max-pages 100

# Large crawl (for arrΓͺtΓ©s with 4010 documents)
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode crawl --max-pages 500

πŸ“Š Available Sources​

Source NameURLExpected Count
mairie_arretespublications-arretes/~4010
mairie_deliberationsdeliberations-conseil-municipal/Unknown
commission_controledocumentheque/?documents_category=49Unknown

πŸ“‚ Output Locations​

ext_data/
β”œβ”€β”€ mairie_arretes/ # ArrΓͺtΓ©s & publications
β”œβ”€β”€ mairie_deliberations/ # DΓ©libΓ©rations
└── commission_controle/ # Commission documents

Each directory contains:

  • *.md - Markdown content
  • *.html - HTML content
  • *_metadata.json - Full page metadata
  • index_*.md - Index of all pages
  • errors.log - Error log (if any)

πŸ” Checking Results​

# Count scraped files
ls ext_data/mairie_arretes/*.md | wc -l

# View index
cat ext_data/mairie_arretes/index_*.md

# Check for errors
cat ext_data/mairie_arretes/errors.log
  1. Test API: poetry run python examples/simple_scrape.py
  2. Explore Structure: --mode scrape on each source
  3. Limited Crawl: --mode crawl --max-pages 10 to validate
  4. Full Crawl: Increase --max-pages based on needs
  5. Review Outputs: Check files in ext_data/

πŸ› οΈ Troubleshooting​

"Failed to initialize Firecrawl"​

"Rate limit exceeded"​

  • Wait a few minutes
  • Reduce --max-pages
  • Process sources one at a time

Empty or Missing Files​

  • Check errors.log in output directory
  • Try --mode scrape first to test structure
  • Verify URL is accessible in browser

πŸ“š Full Documentation​