Skip to main content

Document Anonymization Pipeline

The anonymization pipeline provides PII protection for documents processed through the field input workflow. It auto-detects document type and routes to the appropriate anonymizer.

For feature details (modes, entity types, Opik PII guardrail), see Forseti Anonymization Feature.

For the cost/accuracy/speed analysis, see the Anonymization Trilemma blog post.

Pipeline Integration

Anonymization runs automatically before theme extraction in the field input pipeline:

Input Text


[Document Type Detection] (instant, regex-based)
├── Transcript with names? → Regex Anonymizer (free, instant)
├── Already anonymous? → Skip
└── General document? → LLM Anonymizer (accurate, costly)


[PII Validation] ← Opik Guardrail (NLP check)


[Theme Extraction] → Uses anonymized text

Usage

from app.mockup.field_input import FieldInputGenerator, AnonymizationConfig

gen = FieldInputGenerator(provider="gemini")
result = await gen.process_field_input(
input_text=text,
anonymization_config=AnonymizationConfig(enabled=True, mode="auto")
)

Configuration

@dataclass
class AnonymizationConfig:
enabled: bool = True
mode: Literal["auto", "transcript", "llm", "none"] = "auto"
similarity_threshold: float = 0.85 # For fuzzy name matching
store_mapping: bool = True # Store entity mappings in result

Result Fields

  • anonymization_applied: bool
  • anonymization_type: "transcript" | "llm" | None
  • anonymization_mapping: dict of original to anonymized
  • keywords_from_anonymization: list (LLM mode only)

Frontend Integration

Three Forseti features available in the Contributions tab:

ButtonFeatureDescription
Verify chartervalidateFull charter validation
Classifyclassify_categoryCategory classification
AnonymizeanonymizationPII anonymization

Files

FileDescription
app/mockup/anonymizer.pyTranscript anonymizer, type detection, PII validation
app/agents/forseti/features/anonymization.pyLLM-based anonymization feature
app/mockup/field_input.pyPipeline integration
app/prompts/local/forseti.pyAnonymization prompt template

Pre-Release Test Report

Date: 2026-02-08 Branch: feature/apscheduler-plan-tasks Status: Ready for merge to dev

Test Results

TestStatusDetails
Core importsPASSanonymizer, field_input, features, models
Transcript detectionPASSCorrectly identifies TRANSCRIPT_NAMED
General doc detectionPASSCorrectly identifies GENERAL
Transcript anonymizationPASS2 speakers, names replaced
Empty string handlingPASSReturns GENERAL type, no crash
Already anonymous detectionPASSDetects TRANSCRIPT_ANONYMOUS
Levenshtein dependencyPASSRatio 0.83 for Karine/Carine
AnonymizationConfig defaultsPASSenabled=True, mode=auto
FieldInputResult serializationPASSto_dict() includes anonymization fields
PII validationPASSReturns result (graceful degradation)
Translation keys (EN)PASSAll 9 new keys present
Translation keys (FR)PASSAll 9 new keys present
EN/FR key parityPASSNo missing keys
front.py syntaxPASSAST parse successful
Required functionsPASSAll 7 functions defined
Required importsPASSjson, AnonymizationFeature
Docs buildPASSSuccessfully generated

Known Issues

IssueSeverityMitigation
Opik PII Guardrail API errorsLowFails open with error logged
N8N webhook empty responseLowJSONDecodeError now caught
Docs broken anchors (FR)LowPre-existing, unrelated

Verification Checklist

  • All imports resolve correctly
  • Transcript anonymization works
  • LLM anonymization works
  • Auto-detection routes correctly
  • Error handling graceful
  • Translations complete
  • Frontend buttons functional
  • Docs build successfully

Related: Forseti Agent | Field Input Workflow | Logging