Skip to main content

Mockup System - Contribution Testing Framework

Overview

The mockup system generates controlled variations of citizen contributions to systematically test and improve Forseti 461's charter validation. By creating progressive mutations from valid to invalid contributions, we can:

  1. Identify prompt weaknesses - Find cases where violations slip through
  2. Measure consistency - Ensure similar inputs produce similar outputs
  3. Build training datasets - Create Opik-compatible datasets for prompt optimization
  4. Track accuracy over time - Compare validation results across prompt iterations

Why Mutation Testing?

Charter validation is a nuanced task. A contribution might be:

  • Clearly valid (constructive, local, factual)
  • Clearly invalid (personal attack, spam, off-topic)
  • Borderline (subtle violations, mixed content)

The challenge: LLM-based validation can miss subtle violations or flag valid content incorrectly.

The solution: Generate controlled mutations with known expected outcomes, then measure how well Forseti detects them.

Valid Contribution ──┬── Small mutation (95% similar) → Should still be valid
├── Medium mutation (80% similar) → Borderline
├── Large mutation (60% similar) → Likely invalid
└── Violation injected → Must be detected as invalid

Contribution Format (Framaforms)

All mockup contributions follow the Audierne2026 Framaforms submission format:

{
"id": "unique-contribution-id",
"category": "economie",
"constat_factuel": "Le parking du port est souvent plein en été...",
"idees_ameliorations": "Créer un parking relais à l'entrée de la ville...",
"source": "framaforms",
"expected_valid": true
}
FieldDescription
categoryOne of 7 categories: economie, logement, culture, ecologie, associations, jeunesse, alimentation-bien-etre-soins
constat_factuelFactual observation about the current situation
idees_ameliorationsProposed improvements or solutions
sourceOrigin: framaforms (real), mock (synthetic), derived (mutated), input (Auto-Contribution tab)
expected_validGround truth for testing (null if unknown)

Source Types

SourceOriginForseti ValidatedUse Case
framaformsReal citizen submissionsOn demandGround truth baseline
mockManually created test dataOn demandKnown violation patterns
derivedLLM-mutated from baseOn demandVariation testing
inputQuestions tab (user-created)Pre-saveReal user contributions

The input source is unique: contributions are validated by Forseti 461 before saving, ensuring all user-created contributions have validation metadata attached.

Mutation Strategies

The mockup system supports two mutation strategies:

1. Text-Based Mutations (Levenshtein)

Uses Levenshtein distance for controlled character-level variations:

from app.mockup import levenshtein_ratio, apply_distance

original = "Le port d'Audierne est magnifique"
mutated, distance = apply_distance(original, target_distance=5)
# Result: "Le port d'Audirne est magnifque" (typos introduced)

Best for:

  • Simulating typing errors
  • Testing OCR-like corruption
  • Fast, deterministic mutations

2. LLM-Based Mutations (Ollama/Mistral)

Uses a local LLM to generate semantic variations:

from app.mockup import generate_variations

# Generate variations using Mistral
variations = generate_variations(
constat_factuel="Le parking du port est souvent plein...",
idees_ameliorations="Créer un parking relais...",
num_variations=5,
include_violations=True,
use_llm=True, # Use Ollama/Mistral
llm_model="mistral:latest",
)

Mutation Types (LLM):

TypeDescriptionExpected Valid
paraphraseSame meaning, different words✅ Yes
orthographicRealistic typos and errors✅ Yes
semantic_shiftSlightly different meaning⚠️ Borderline
subtle_violationHidden charter violation❌ No
aggressiveObvious violation (attacks, caps)❌ No
off_topicDrifts to unrelated subjects❌ No

Requirements:

  • Ollama running locally (ollama serve)
  • Mistral model pulled (ollama pull mistral)

Choosing a Strategy

CriterionText (Levenshtein)LLM (Mistral)
SpeedFastSlower
RealismLowHigh
Semantic understandingNoneFull
Requires GPU/OllamaNoYes
DeterministicYesNo

Recommendation: Use LLM mutations for realistic testing, text mutations for quick iteration.

Combined Approach

For comprehensive testing, use both:

# Quick text-based mutations for volume
text_variations = generate_variations(text, num_variations=20, use_llm=False)

# High-quality LLM mutations for edge cases
llm_variations = generate_variations(text, num_variations=5, use_llm=True)

Violation Categories

VIOLATION_PATTERNS = {
"personal_attack": [
"Le maire est incompétent",
"Cette personne ne comprend rien",
],
"off_topic": [
"D'ailleurs, parlons de la politique nationale",
"Sans rapport avec Audierne, mais...",
],
"non_constructive": [
"C'est nul, point final",
"Rien ne marchera jamais ici",
],
"aggressive": [
"Bande d'incapables !",
"Vous êtes tous corrompus",
],
}

JSON Input Format

Base Contributions (contributions.json)

Located at app/mockup/data/contributions.json:

{
"contributions": [
{
"id": "framaforms-eco-001",
"category": "economie",
"constat_factuel": "Les commerces du centre-ville souffrent...",
"idees_ameliorations": "Je propose de créer un programme...",
"source": "framaforms",
"expected_valid": true
},
{
"id": "mock-invalid-001",
"category": "culture",
"constat_factuel": "Le maire est un idiot qui ne fait rien.",
"idees_ameliorations": "Il faut le virer immédiatement.",
"source": "mock",
"expected_valid": false,
"violations_injected": ["personal_attack", "aggressive"]
}
]
}

AI-Generated Contributions

Use an LLM to generate realistic contributions:

prompt = """
Generate 5 citizen contributions for Audierne in JSON format.
Mix valid and invalid examples. Include:
- 3 valid, constructive proposals
- 1 with subtle off-topic content
- 1 with mild personal criticism

Format:
{
"contributions": [
{
"category": "economie|logement|culture|ecologie|associations|jeunesse|alimentation-bien-etre-soins",
"constat_factuel": "Factual observation about current situation",
"idees_ameliorations": "Proposed improvements",
"expected_valid": true|false,
"violations_injected": ["type"] // if invalid
}
]
}
"""

Storage Architecture

Storage Priority

The system uses a Redis-first approach with JSON fallback:

Load Order:
1. Redis storage (get_latest_validations) ← Primary
2. JSON file (contributions.json) ← Fallback if Redis empty

This ensures:

  • User contributions from Questions tab are immediately visible
  • Field Input generated contributions are persisted
  • Local development works without Redis (JSON fallback)

Redis Keys

Validation results are stored in Redis for historical analysis:

contribution_mockup:forseti461:charter:{date}:{id}

Example:

contribution_mockup:forseti461:charter:2026-01-26:framaforms-eco-001
contribution_mockup:forseti461:charter:2026-01-29:input_abc123def456 ← From Questions tab

Data Structure

{
"id": "framaforms-eco-001",
"date": "2026-01-26",
"title": "Generated from constat_factuel",
"body": "Combined constat + idees",
"category": "economie",
"constat_factuel": "...",
"idees_ameliorations": "...",

"is_valid": true,
"violations": [],
"encouraged_aspects": ["Proposition concrète"],
"confidence": 0.92,
"reasoning": "...",

"source": "framaforms",
"expected_valid": true,
"execution_time_ms": 1250,
"trace_id": "opik-trace-abc"
}

Opik Integration

Dataset Format

Validation records export to Opik-compatible format for prompt optimization:

{
"input": {
"title": "...",
"body": "...",
"category": "economie",
"constat_factuel": "...",
"idees_ameliorations": "..."
},
"expected_output": {
"is_valid": true,
"violations": [],
"encouraged_aspects": ["..."],
"confidence": 0.92,
"reasoning": "...",
"category": "economie"
},
"metadata": {
"id": "...",
"date": "2026-01-26",
"source": "framaforms",
"expected_valid": true,
"violations_injected": []
}
}

Optimization Workflow

1. Generate/load contributions

2. Run batch validation with Forseti

3. Store results in Redis

4. Export to Opik dataset

5. Create train/val/test split

6. Run daily Opik experiment (track metrics)

7. Run Opik optimizer

8. Update Forseti prompts

9. Repeat with new test set

Daily Experiments

The mockup processor supports Opik experiments for tracking Forseti's performance over time:

from app.processors import MockupProcessor

processor = MockupProcessor()

# Run daily experiment
result = await processor.run_daily_experiment(
validate_func=forseti.validate,
source_filter=["framaforms", "mock"],
)

print(f"Accuracy: {result.charter_accuracy:.1%}")
print(f"F1 Score: {result.f1_score:.2f}")
print(f"False Negatives: {result.false_negatives}") # Missed violations!

Custom Metrics

Three charter-specific metrics are tracked:

MetricDescriptionGoal
charter_accuracyMatch between Forseti result and expected> 95%
violation_detectionDetecting injected violations> 98%
confidence_calibrationHigh confidence = correct prediction> 0.8

Confusion Matrix

For charter validation, we track:

  • True Positive (TP): Invalid contribution correctly rejected ✅
  • True Negative (TN): Valid contribution correctly accepted ✅
  • False Positive (FP): Valid contribution incorrectly rejected ⚠️
  • False Negative (FN): Invalid contribution incorrectly accepted ❌ (worst!)

Key goal: Minimize False Negatives - a missed charter violation reaching the platform is worse than incorrectly flagging a valid contribution.

Supported Optimizers

  • FewShotBayesianOptimizer - Selects best examples for few-shot prompts
  • MetaPromptOptimizer - Uses LLM to generate/refine prompts
  • MiproOptimizer - DSPy-based optimization
  • EvolutionaryOptimizer - Genetic algorithm for prompt evolution

Usage

Streamlit UI

Access via the Mockup tab (?tab=mockup):

  1. Load Existing - Load contributions from Redis (fallback: JSON file)
  2. Generate Variations - Create Levenshtein mutations
  3. Single Contribution - Test one contribution manually
  4. Field Input - Generate from reports/docs
  5. Storage & Opik - View statistics, export datasets

Auto-Contribution Tab - Contribution Assistant

The Auto-Contribution tab (?tab=autocontrib) provides a user-friendly 5-step workflow for citizens to create charter-compliant contributions:

📚 Source Selection → 🏷️ Category → ✨ AI Draft → ✏️ Edit → 🔍 Validate & Save
↓ ↓ ↓ ↓ ↓
(Audierne docs (7 categories) (LLM generates (User edits (Forseti 461
or paste text) draft) fields) validates)

Store in Redis
with validation results

5-Step Workflow:

StepNameDescription
1SourceSelect inspiration from Audierne2026 docs or paste custom text
2CategoryChoose one of 7 categories (economie, logement, culture, etc.)
3InspirationAI generates draft constat_factuel + idees_ameliorations
4EditUser modifies both fields in editable text areas
5SaveForseti 461 validates, then saves to Redis with results

Key Features:

  • Bilingual Support - UI and AI drafts in French or English
  • AI-Assisted Drafting - LLM generates contextual draft based on source document
  • Pre-Save Validation - Forseti 461 validates before storing
  • Full Traceability - Validation results (is_valid, violations, confidence) stored with contribution

Storage Format:

Contributions from the Questions tab are stored with source: "input" to distinguish them from mockup-generated data:

{
"id": "input_abc123def456",
"source": "input",
"category": "economie",
"constat_factuel": "Le parking du port est souvent saturé...",
"idees_ameliorations": "Créer un parking relais à l'entrée...",
"is_valid": true,
"violations": [],
"encouraged_aspects": ["Proposition concrète", "Ancrage local"],
"confidence": 0.92,
"reasoning": "Contribution constructive et locale...",
"provider": "gemini",
"model": "gemini-2.5-flash"
}

Integration with Mockup Tab:

Contributions created via the Auto-Contribution tab appear in the Mockup tab's "Load Existing" view:

  1. Mockup tab first loads from Redis storage
  2. Falls back to JSON file if Redis is empty
  3. Filter by source: "input" to see user-created contributions
  4. Run batch validation or export to Opik datasets

Field Input Workflow

Generate themed contributions from real municipal data:

📋 Field Input → 📦 Chunking → 🔍 Theme Extraction → 🏷️ Category → 📝 Contribution Generation
↓ ↓ ↓ ↓ ↓
(90k+ chars) (15k chunks) (per chunk) (7 categories) (valid + violations)
↓ ↓ ↓
(500 overlap) (deduplicate) Store in Redis/JSON

Run Opik Experiment

Document Chunking

Large documents (like public hearing transcripts, council minutes, etc.) are automatically chunked for LLM processing:

ParameterValuePurpose
CHUNK_SIZE15,000 charsMax text per LLM call
CHUNK_OVERLAP500 charsContext continuity between chunks
Max input~100k+ charsHandles full council minutes

Processing Flow:

  1. Input text split into ~15k character chunks (preserving word boundaries)
  2. Each chunk sent to LLM for theme extraction
  3. Themes deduplicated across all chunks (same theme from different chunks merged)
  4. Contributions generated from unique themes

Available Input Sources:

  • Audierne2026 Docs - Select from docs/docs/audierne2026/ markdown files
  • Paste Text - Directly paste content from reports, speeches, etc.
  • Upload File - Upload markdown or text files

Example:

from app.mockup import process_field_input_sync

# Process a public hearing report
result = process_field_input_sync(
input_text=open("rapport_audience.md").read(),
source_title="Audience publique - Rénovation École",
contributions_per_theme=2,
include_violations=True,
)

print(f"Extracted {result.themes_extracted} themes")
print(f"Generated {result.contributions_generated} contributions")
print(f"Categories: {result.categories_covered}")

Daily Experiment Workflow:

  1. Inject field data through UI (report, speech, etc.)
  2. LLM extracts themes across 7 categories
  3. Generate contributions (valid + violations)
  4. Save to Redis
  5. Run Opik experiment with current Forseti prompt
  6. Track accuracy over time

Programmatic API

from app.mockup import (
load_contributions,
generate_variations,
get_storage,
get_dataset_manager,
)

# Load base contributions
generator = load_contributions()

# Generate variations with violations
variations = generate_variations(
constat_factuel="Le port est souvent saturé...",
idees_ameliorations="Créer un parking relais...",
category="economie",
num_variations=5,
include_violations=True,
)

# After validation, export to Opik
manager = get_dataset_manager()
manager.create_charter_dataset("forseti-charter-v2")
manager.add_from_redis(date_str="2026-01-26")
manager.sync_to_opik()

Metrics

Accuracy Metrics

MetricFormulaTarget
PrecisionTP / (TP + FP)> 95%
RecallTP / (TP + FN)> 98%
F1 Score2 _ (P _ R) / (P + R)> 96%

Where:

  • TP = Invalid contribution correctly rejected
  • FP = Valid contribution incorrectly rejected
  • FN = Invalid contribution incorrectly accepted (most dangerous)

Key Goal

Minimize False Negatives: A missed charter violation reaching the platform is worse than incorrectly flagging a valid contribution (which can be manually reviewed).

Logging & Debugging

The mockup system uses dedicated logging for debugging and monitoring:

Log Files:

  • logs/mockup.log - All mockup operations (DEBUG level)
  • logs/mockup_errors.log - Error-only log

Using MockupLogger:

from app.services.logging import MockupLogger

logger = MockupLogger("field_input_generator")

# Structured logging with kwargs
logger.info("PROCESS_START", source="direct_input", length=90034)
logger.info("CHUNKING", total_length=90034, chunks=7)
logger.debug("LLM_RESPONSE", chunk=0, length=424)
logger.info("THEMES_EXTRACTED", total=15, unique=12)

Common Log Events:

EventLevelDescription
PROCESS_STARTINFOField input processing started
CHUNKINGINFODocument split into chunks
LLM_RESPONSEDEBUGRaw LLM response per chunk
CHUNK_THEMESINFOThemes found in each chunk
THEMES_EXTRACTEDINFOFinal unique themes after dedup
CONTRIBUTIONS_SAVEDINFOContributions stored to Redis/JSON
PROCESS_COMPLETEINFOFull processing summary

See Logging System for full documentation.

Best Practices

  1. Diverse test sets - Include all 7 categories and violation types
  2. Progressive difficulty - Test from obvious to subtle violations
  3. Real data baseline - Include actual Framaforms submissions when available
  4. Regular re-testing - Run tests after any prompt changes
  5. Track over time - Compare accuracy across prompt versions

File Structure

app/mockup/
├── __init__.py # Module exports
├── generator.py # MockContribution, ContributionGenerator
├── levenshtein.py # Distance calculations, text mutations
├── llm_mutations.py # LLM-based semantic mutations (Ollama)
├── field_input.py # Field input processing (reports → contributions)
├── storage.py # Redis storage, ValidationRecord
├── dataset.py # Opik dataset management
├── batch_view.py # Streamlit UI (5 modes)
└── data/
├── contributions.json # Generated test contributions (fallback)
└── category_themes.json # Category themes for field input

app/auto_contribution/
├── __init__.py # Module exports
└── views.py # 5-step Streamlit workflow UI

app/processors/workflows/
├── __init__.py # Workflow exports
└── workflow_autocontribution.py # Business logic (ContributionAssistant, 5 steps)

app/translations/
├── fr.json # French translations (autocontrib_* keys)
└── en.json # English translations (autocontrib_* keys)