Mockup System - Contribution Testing Framework
Overview
The mockup system generates controlled variations of citizen contributions to systematically test and improve Forseti 461's charter validation. By creating progressive mutations from valid to invalid contributions, we can:
- Identify prompt weaknesses - Find cases where violations slip through
- Measure consistency - Ensure similar inputs produce similar outputs
- Build training datasets - Create Opik-compatible datasets for prompt optimization
- Track accuracy over time - Compare validation results across prompt iterations
Why Mutation Testing?
Charter validation is a nuanced task. A contribution might be:
- Clearly valid (constructive, local, factual)
- Clearly invalid (personal attack, spam, off-topic)
- Borderline (subtle violations, mixed content)
The challenge: LLM-based validation can miss subtle violations or flag valid content incorrectly.
The solution: Generate controlled mutations with known expected outcomes, then measure how well Forseti detects them.
Valid Contribution ──┬── Small mutation (95% similar) → Should still be valid
├── Medium mutation (80% similar) → Borderline
├── Large mutation (60% similar) → Likely invalid
└── Violation injected → Must be detected as invalid
Contribution Format (Framaforms)
All mockup contributions follow the Audierne2026 Framaforms submission format:
{
"id": "unique-contribution-id",
"category": "economie",
"constat_factuel": "Le parking du port est souvent plein en été...",
"idees_ameliorations": "Créer un parking relais à l'entrée de la ville...",
"source": "framaforms",
"expected_valid": true
}
| Field | Description |
|---|---|
category | One of 7 categories: economie, logement, culture, ecologie, associations, jeunesse, alimentation-bien-etre-soins |
constat_factuel | Factual observation about the current situation |
idees_ameliorations | Proposed improvements or solutions |
source | Origin: framaforms (real), mock (synthetic), derived (mutated), input (Auto-Contribution tab) |
expected_valid | Ground truth for testing (null if unknown) |
Source Types
| Source | Origin | Forseti Validated | Use Case |
|---|---|---|---|
framaforms | Real citizen submissions | On demand | Ground truth baseline |
mock | Manually created test data | On demand | Known violation patterns |
derived | LLM-mutated from base | On demand | Variation testing |
input | Questions tab (user-created) | Pre-save | Real user contributions |
The input source is unique: contributions are validated by Forseti 461 before saving, ensuring all user-created contributions have validation metadata attached.
Mutation Strategies
The mockup system supports two mutation strategies:
1. Text-Based Mutations (Levenshtein)
Uses Levenshtein distance for controlled character-level variations:
from app.mockup import levenshtein_ratio, apply_distance
original = "Le port d'Audierne est magnifique"
mutated, distance = apply_distance(original, target_distance=5)
# Result: "Le port d'Audirne est magnifque" (typos introduced)
Best for:
- Simulating typing errors
- Testing OCR-like corruption
- Fast, deterministic mutations
2. LLM-Based Mutations (Ollama/Mistral)
Uses a local LLM to generate semantic variations:
from app.mockup import generate_variations
# Generate variations using Mistral
variations = generate_variations(
constat_factuel="Le parking du port est souvent plein...",
idees_ameliorations="Créer un parking relais...",
num_variations=5,
include_violations=True,
use_llm=True, # Use Ollama/Mistral
llm_model="mistral:latest",
)
Mutation Types (LLM):
| Type | Description | Expected Valid |
|---|---|---|
paraphrase | Same meaning, different words | ✅ Yes |
orthographic | Realistic typos and errors | ✅ Yes |
semantic_shift | Slightly different meaning | ⚠️ Borderline |
subtle_violation | Hidden charter violation | ❌ No |
aggressive | Obvious violation (attacks, caps) | ❌ No |
off_topic | Drifts to unrelated subjects | ❌ No |
Requirements:
- Ollama running locally (
ollama serve) - Mistral model pulled (
ollama pull mistral)
Choosing a Strategy
| Criterion | Text (Levenshtein) | LLM (Mistral) |
|---|---|---|
| Speed | Fast | Slower |
| Realism | Low | High |
| Semantic understanding | None | Full |
| Requires GPU/Ollama | No | Yes |
| Deterministic | Yes | No |
Recommendation: Use LLM mutations for realistic testing, text mutations for quick iteration.
Combined Approach
For comprehensive testing, use both:
# Quick text-based mutations for volume
text_variations = generate_variations(text, num_variations=20, use_llm=False)
# High-quality LLM mutations for edge cases
llm_variations = generate_variations(text, num_variations=5, use_llm=True)
Violation Categories
VIOLATION_PATTERNS = {
"personal_attack": [
"Le maire est incompétent",
"Cette personne ne comprend rien",
],
"off_topic": [
"D'ailleurs, parlons de la politique nationale",
"Sans rapport avec Audierne, mais...",
],
"non_constructive": [
"C'est nul, point final",
"Rien ne marchera jamais ici",
],
"aggressive": [
"Bande d'incapables !",
"Vous êtes tous corrompus",
],
}
JSON Input Format
Base Contributions (contributions.json)
Located at app/mockup/data/contributions.json:
{
"contributions": [
{
"id": "framaforms-eco-001",
"category": "economie",
"constat_factuel": "Les commerces du centre-ville souffrent...",
"idees_ameliorations": "Je propose de créer un programme...",
"source": "framaforms",
"expected_valid": true
},
{
"id": "mock-invalid-001",
"category": "culture",
"constat_factuel": "Le maire est un idiot qui ne fait rien.",
"idees_ameliorations": "Il faut le virer immédiatement.",
"source": "mock",
"expected_valid": false,
"violations_injected": ["personal_attack", "aggressive"]
}
]
}
AI-Generated Contributions
Use an LLM to generate realistic contributions:
prompt = """
Generate 5 citizen contributions for Audierne in JSON format.
Mix valid and invalid examples. Include:
- 3 valid, constructive proposals
- 1 with subtle off-topic content
- 1 with mild personal criticism
Format:
{
"contributions": [
{
"category": "economie|logement|culture|ecologie|associations|jeunesse|alimentation-bien-etre-soins",
"constat_factuel": "Factual observation about current situation",
"idees_ameliorations": "Proposed improvements",
"expected_valid": true|false,
"violations_injected": ["type"] // if invalid
}
]
}
"""
Storage Architecture
Storage Priority
The system uses a Redis-first approach with JSON fallback:
Load Order:
1. Redis storage (get_latest_validations) ← Primary
2. JSON file (contributions.json) ← Fallback if Redis empty
This ensures:
- User contributions from Questions tab are immediately visible
- Field Input generated contributions are persisted
- Local development works without Redis (JSON fallback)
Redis Keys
Validation results are stored in Redis for historical analysis:
contribution_mockup:forseti461:charter:{date}:{id}
Example:
contribution_mockup:forseti461:charter:2026-01-26:framaforms-eco-001
contribution_mockup:forseti461:charter:2026-01-29:input_abc123def456 ← From Questions tab
Data Structure
{
"id": "framaforms-eco-001",
"date": "2026-01-26",
"title": "Generated from constat_factuel",
"body": "Combined constat + idees",
"category": "economie",
"constat_factuel": "...",
"idees_ameliorations": "...",
"is_valid": true,
"violations": [],
"encouraged_aspects": ["Proposition concrète"],
"confidence": 0.92,
"reasoning": "...",
"source": "framaforms",
"expected_valid": true,
"execution_time_ms": 1250,
"trace_id": "opik-trace-abc"
}
Opik Integration
Dataset Format
Validation records export to Opik-compatible format for prompt optimization:
{
"input": {
"title": "...",
"body": "...",
"category": "economie",
"constat_factuel": "...",
"idees_ameliorations": "..."
},
"expected_output": {
"is_valid": true,
"violations": [],
"encouraged_aspects": ["..."],
"confidence": 0.92,
"reasoning": "...",
"category": "economie"
},
"metadata": {
"id": "...",
"date": "2026-01-26",
"source": "framaforms",
"expected_valid": true,
"violations_injected": []
}
}
Optimization Workflow
1. Generate/load contributions
↓
2. Run batch validation with Forseti
↓
3. Store results in Redis
↓
4. Export to Opik dataset
↓
5. Create train/val/test split
↓
6. Run daily Opik experiment (track metrics)
↓
7. Run Opik optimizer
↓
8. Update Forseti prompts
↓
9. Repeat with new test set
Daily Experiments
The mockup processor supports Opik experiments for tracking Forseti's performance over time:
from app.processors import MockupProcessor
processor = MockupProcessor()
# Run daily experiment
result = await processor.run_daily_experiment(
validate_func=forseti.validate,
source_filter=["framaforms", "mock"],
)
print(f"Accuracy: {result.charter_accuracy:.1%}")
print(f"F1 Score: {result.f1_score:.2f}")
print(f"False Negatives: {result.false_negatives}") # Missed violations!
Custom Metrics
Three charter-specific metrics are tracked:
| Metric | Description | Goal |
|---|---|---|
charter_accuracy | Match between Forseti result and expected | > 95% |
violation_detection | Detecting injected violations | > 98% |
confidence_calibration | High confidence = correct prediction | > 0.8 |
Confusion Matrix
For charter validation, we track:
- True Positive (TP): Invalid contribution correctly rejected ✅
- True Negative (TN): Valid contribution correctly accepted ✅
- False Positive (FP): Valid contribution incorrectly rejected ⚠️
- False Negative (FN): Invalid contribution incorrectly accepted ❌ (worst!)
Key goal: Minimize False Negatives - a missed charter violation reaching the platform is worse than incorrectly flagging a valid contribution.
Supported Optimizers
- FewShotBayesianOptimizer - Selects best examples for few-shot prompts
- MetaPromptOptimizer - Uses LLM to generate/refine prompts
- MiproOptimizer - DSPy-based optimization
- EvolutionaryOptimizer - Genetic algorithm for prompt evolution
Usage
Streamlit UI
Access via the Mockup tab (?tab=mockup):
- Load Existing - Load contributions from Redis (fallback: JSON file)
- Generate Variations - Create Levenshtein mutations
- Single Contribution - Test one contribution manually
- Field Input - Generate from reports/docs
- Storage & Opik - View statistics, export datasets
Auto-Contribution Tab - Contribution Assistant
The Auto-Contribution tab (?tab=autocontrib) provides a user-friendly 5-step workflow for citizens to create charter-compliant contributions:
📚 Source Selection → 🏷️ Category → ✨ AI Draft → ✏️ Edit → 🔍 Validate & Save
↓ ↓ ↓ ↓ ↓
(Audierne docs (7 categories) (LLM generates (User edits (Forseti 461
or paste text) draft) fields) validates)
↓
Store in Redis
with validation results
5-Step Workflow:
| Step | Name | Description |
|---|---|---|
| 1 | Source | Select inspiration from Audierne2026 docs or paste custom text |
| 2 | Category | Choose one of 7 categories (economie, logement, culture, etc.) |
| 3 | Inspiration | AI generates draft constat_factuel + idees_ameliorations |
| 4 | Edit | User modifies both fields in editable text areas |
| 5 | Save | Forseti 461 validates, then saves to Redis with results |
Key Features:
- Bilingual Support - UI and AI drafts in French or English
- AI-Assisted Drafting - LLM generates contextual draft based on source document
- Pre-Save Validation - Forseti 461 validates before storing
- Full Traceability - Validation results (is_valid, violations, confidence) stored with contribution
Storage Format:
Contributions from the Questions tab are stored with source: "input" to distinguish them from mockup-generated data:
{
"id": "input_abc123def456",
"source": "input",
"category": "economie",
"constat_factuel": "Le parking du port est souvent saturé...",
"idees_ameliorations": "Créer un parking relais à l'entrée...",
"is_valid": true,
"violations": [],
"encouraged_aspects": ["Proposition concrète", "Ancrage local"],
"confidence": 0.92,
"reasoning": "Contribution constructive et locale...",
"provider": "gemini",
"model": "gemini-2.5-flash"
}
Integration with Mockup Tab:
Contributions created via the Auto-Contribution tab appear in the Mockup tab's "Load Existing" view:
- Mockup tab first loads from Redis storage
- Falls back to JSON file if Redis is empty
- Filter by
source: "input"to see user-created contributions - Run batch validation or export to Opik datasets
Field Input Workflow
Generate themed contributions from real municipal data:
📋 Field Input → 📦 Chunking → 🔍 Theme Extraction → 🏷️ Category → 📝 Contribution Generation
↓ ↓ ↓ ↓ ↓
(90k+ chars) (15k chunks) (per chunk) (7 categories) (valid + violations)
↓ ↓ ↓
(500 overlap) (deduplicate) Store in Redis/JSON
↓
Run Opik Experiment
Document Chunking
Large documents (like public hearing transcripts, council minutes, etc.) are automatically chunked for LLM processing:
| Parameter | Value | Purpose |
|---|---|---|
CHUNK_SIZE | 15,000 chars | Max text per LLM call |
CHUNK_OVERLAP | 500 chars | Context continuity between chunks |
| Max input | ~100k+ chars | Handles full council minutes |
Processing Flow:
- Input text split into ~15k character chunks (preserving word boundaries)
- Each chunk sent to LLM for theme extraction
- Themes deduplicated across all chunks (same theme from different chunks merged)
- Contributions generated from unique themes
Available Input Sources:
- Audierne2026 Docs - Select from
docs/docs/audierne2026/markdown files - Paste Text - Directly paste content from reports, speeches, etc.
- Upload File - Upload markdown or text files
Example:
from app.mockup import process_field_input_sync
# Process a public hearing report
result = process_field_input_sync(
input_text=open("rapport_audience.md").read(),
source_title="Audience publique - Rénovation École",
contributions_per_theme=2,
include_violations=True,
)
print(f"Extracted {result.themes_extracted} themes")
print(f"Generated {result.contributions_generated} contributions")
print(f"Categories: {result.categories_covered}")
Daily Experiment Workflow:
- Inject field data through UI (report, speech, etc.)
- LLM extracts themes across 7 categories
- Generate contributions (valid + violations)
- Save to Redis
- Run Opik experiment with current Forseti prompt
- Track accuracy over time
Programmatic API
from app.mockup import (
load_contributions,
generate_variations,
get_storage,
get_dataset_manager,
)
# Load base contributions
generator = load_contributions()
# Generate variations with violations
variations = generate_variations(
constat_factuel="Le port est souvent saturé...",
idees_ameliorations="Créer un parking relais...",
category="economie",
num_variations=5,
include_violations=True,
)
# After validation, export to Opik
manager = get_dataset_manager()
manager.create_charter_dataset("forseti-charter-v2")
manager.add_from_redis(date_str="2026-01-26")
manager.sync_to_opik()
Metrics
Accuracy Metrics
| Metric | Formula | Target |
|---|---|---|
| Precision | TP / (TP + FP) | > 95% |
| Recall | TP / (TP + FN) | > 98% |
| F1 Score | 2 _ (P _ R) / (P + R) | > 96% |
Where:
- TP = Invalid contribution correctly rejected
- FP = Valid contribution incorrectly rejected
- FN = Invalid contribution incorrectly accepted (most dangerous)
Key Goal
Minimize False Negatives: A missed charter violation reaching the platform is worse than incorrectly flagging a valid contribution (which can be manually reviewed).
Logging & Debugging
The mockup system uses dedicated logging for debugging and monitoring:
Log Files:
logs/mockup.log- All mockup operations (DEBUG level)logs/mockup_errors.log- Error-only log
Using MockupLogger:
from app.services.logging import MockupLogger
logger = MockupLogger("field_input_generator")
# Structured logging with kwargs
logger.info("PROCESS_START", source="direct_input", length=90034)
logger.info("CHUNKING", total_length=90034, chunks=7)
logger.debug("LLM_RESPONSE", chunk=0, length=424)
logger.info("THEMES_EXTRACTED", total=15, unique=12)
Common Log Events:
| Event | Level | Description |
|---|---|---|
PROCESS_START | INFO | Field input processing started |
CHUNKING | INFO | Document split into chunks |
LLM_RESPONSE | DEBUG | Raw LLM response per chunk |
CHUNK_THEMES | INFO | Themes found in each chunk |
THEMES_EXTRACTED | INFO | Final unique themes after dedup |
CONTRIBUTIONS_SAVED | INFO | Contributions stored to Redis/JSON |
PROCESS_COMPLETE | INFO | Full processing summary |
See Logging System for full documentation.
Best Practices
- Diverse test sets - Include all 7 categories and violation types
- Progressive difficulty - Test from obvious to subtle violations
- Real data baseline - Include actual Framaforms submissions when available
- Regular re-testing - Run tests after any prompt changes
- Track over time - Compare accuracy across prompt versions
File Structure
app/mockup/
├── __init__.py # Module exports
├── generator.py # MockContribution, ContributionGenerator
├── levenshtein.py # Distance calculations, text mutations
├── llm_mutations.py # LLM-based semantic mutations (Ollama)
├── field_input.py # Field input processing (reports → contributions)
├── storage.py # Redis storage, ValidationRecord
├── dataset.py # Opik dataset management
├── batch_view.py # Streamlit UI (5 modes)
└── data/
├── contributions.json # Generated test contributions (fallback)
└── category_themes.json # Category themes for field input
app/auto_contribution/
├── __init__.py # Module exports
└── views.py # 5-step Streamlit workflow UI
app/processors/workflows/
├── __init__.py # Workflow exports
└── workflow_autocontribution.py # Business logic (ContributionAssistant, 5 steps)
app/translations/
├── fr.json # French translations (autocontrib_* keys)
└── en.json # English translations (autocontrib_* keys)