The Þorn in the Þread
On sovereignty, identity, and the glyph that binds them
Articles on observability practices and tools
View All TagsOn sovereignty, identity, and the glyph that binds them
Today we completed a major architectural milestone: modular prompt management for Forseti461. Each feature now has its own versioned prompt in Opik, enabling independent optimization and A/B testing.
From a single monolithic prompt to a clean separation of concerns — each Forseti feature can now evolve independently while sharing a common persona.
Forseti461 is an AI agent that automatically moderates citizen contributions to participatory democracy platforms — approving only concrete, constructive, locally relevant ideas while rejecting personal attacks, spam, off-topic posts, or misinformation, and always explaining decisions with respectful, actionable feedback.
This weekend, Facebook reminded us that democracy is fragile. Toxic comments, personal attacks, and off-topic rants flooded discussions about local issues. The signal gets lost in the noise. Citizens disengage. Constructive voices give up.
What if we could protect civic discourse at scale?
Goal: Create a systematic approach to test and improve our AI-powered charter validation system.
For the Encode Hackathon first submission, we focused on building the infrastructure to ensure Forseti 461 (our charter validation agent) catches all violations reliably. The key insight: you can't improve what you can't measure.
This lecture, led by Abby Morgan, an AI Research Engineer, introduces AI evaluation as a systematic feedback loop for transitioning prototypes to production-ready systems. It outlines the four key components of a useful evaluation: a target capability, a test set, a scoring method, and decision rules. The session differentiates between general benchmarks and specific product evaluations, emphasizing the need for observability in agent evaluation. It demonstrates using OPIK, an open-source tool, to track, debug, and evaluate LLM agents through features like traces, spans, 'LM as a judge', and regression testing datasets.