Firecrawl Quick Reference
π One-Time Setupβ
# 1. Set API key
export FIRECRAWL_API_KEY="your_key_here"
# 2. Install dependencies (if not done)
poetry install
π Common Commandsβ
Test Connectionβ
poetry run python /examples/simple_scrape.py
Dry Run (see what would happen)β
poetry run python src/crawl_municipal_docs.py --dry-run
Scrape Single Page (Testing)β
# Test one source
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode scrape
# Test all sources
poetry run python src/crawl_municipal_docs.py --source all --mode scrape
Crawl Full Site (Production)β
# One source, limited pages
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode crawl --max-pages 50
# All sources, up to 100 pages each
poetry run python src/crawl_municipal_docs.py --source all --mode crawl --max-pages 100
# Large crawl (for arrΓͺtΓ©s with 4010 documents)
poetry run python src/crawl_municipal_docs.py --source mairie_arretes --mode crawl --max-pages 500
π Available Sourcesβ
| Source Name | URL | Expected Count |
|---|---|---|
mairie_arretes | publications-arretes/ | ~4010 |
mairie_deliberations | deliberations-conseil-municipal/ | Unknown |
commission_controle | documentheque/?documents_category=49 | Unknown |
π Output Locationsβ
ext_data/
βββ mairie_arretes/ # ArrΓͺtΓ©s & publications
βββ mairie_deliberations/ # DΓ©libΓ©rations
βββ commission_controle/ # Commission documents
Each directory contains:
*.md- Markdown content*.html- HTML content*_metadata.json- Full page metadataindex_*.md- Index of all pageserrors.log- Error log (if any)
π Checking Resultsβ
# Count scraped files
ls ext_data/mairie_arretes/*.md | wc -l
# View index
cat ext_data/mairie_arretes/index_*.md
# Check for errors
cat ext_data/mairie_arretes/errors.log
π‘ Recommended Workflowβ
- Test API:
poetry run python examples/simple_scrape.py - Explore Structure:
--mode scrapeon each source - Limited Crawl:
--mode crawl --max-pages 10to validate - Full Crawl: Increase
--max-pagesbased on needs - Review Outputs: Check files in
ext_data/
π οΈ Troubleshootingβ
"Failed to initialize Firecrawl"β
- Check API key:
echo $FIRECRAWL_API_KEY - Get key from: https://firecrawl.dev
"Rate limit exceeded"β
- Wait a few minutes
- Reduce
--max-pages - Process sources one at a time
Empty or Missing Filesβ
- Check
errors.login output directory - Try
--mode scrapefirst to test structure - Verify URL is accessible in browser
π Full Documentationβ
- Complete Guide: FIRECRAWL_GUIDE.md
- Examples: examples/
- Configuration: src/config.py