The Þorn in the Þread
On sovereignty, identity, and the glyph that binds them
Articles on observability practices and tools
Voir tous les tagsOn sovereignty, identity, and the glyph that binds them
Today we completed a major architectural milestone: modular prompt management for Forseti461. Each feature now has its own versioned prompt in Opik, enabling independent optimization and A/B testing.
From a single monolithic prompt to a clean separation of concerns — each Forseti feature can now evolve independently while sharing a common persona.
Forseti461 est un agent IA qui modère automatiquement les contributions citoyennes sur les plateformes de démocratie participative — approuvant uniquement les idées concrètes, constructives et localement pertinentes, tout en rejetant les attaques personnelles, le spam, les hors-sujets ou la désinformation, et en expliquant toujours ses décisions avec des retours respectueux et actionnables.
Ce week-end, Facebook nous a rappelé que la démocratie est fragile. Commentaires toxiques, attaques personnelles et diatribes hors-sujet ont envahi les discussions sur les enjeux locaux. Le signal se perd dans le bruit. Les citoyens se désengagent. Les voix constructives abandonnent.
Et si nous pouvions protéger le débat civique à grande échelle ?
Goal: Create a systematic approach to test and improve our AI-powered charter validation system.
For the Encode Hackathon first submission, we focused on building the infrastructure to ensure Forseti 461 (our charter validation agent) catches all violations reliably. The key insight: you can't improve what you can't measure.
This lecture, led by Abby Morgan, an AI Research Engineer, introduces AI evaluation as a systematic feedback loop for transitioning prototypes to production-ready systems. It outlines the four key components of a useful evaluation: a target capability, a test set, a scoring method, and decision rules. The session differentiates between general benchmarks and specific product evaluations, emphasizing the need for observability in agent evaluation. It demonstrates using OPIK, an open-source tool, to track, debug, and evaluate LLM agents through features like traces, spans, 'LM as a judge', and regression testing datasets.