Hackathon kickoff meeting
Date & Time: 2026-01-13 19:04:06 Location: online presentation
AI Agent HackathonNew Year's ResolutionsAI Evaluation
Theme
This lecture introduces the "Commit to Change AI Agent Hackathon," a four-week online event challenging participants to build AI agents that help users stick to their New Year's resolutions. The event offers $30,000 in prizes across five tracks: Productivity, Personal Growth, Social Impact, Health/Wellness, and Financial Health. It emphasizes the importance of AI evaluations, defining them as structured measurements of system behavior. The lecture details the hackathon's timeline, submission requirements using the ENCODE platform, and the mandatory use of the OPIC tool for evaluation, guiding participants from ideation to final project submission.
Takeaways
- Introduction to the Commit to Change AI Agent Hackathon.
- The hackathon's goal is to build AI that turns New Year's resolutions into real results.
- Hackathon registration is required on the Encode Cloud platform for all participants, including those from Luma.
- The hackathon is sponsored by Comet, with supporting partners Basel and Google DeepMind.
- This is a four-week online hackathon with up to $30,000 in prizes.
- Statistics on New Year's resolutions: 23% give up within 13 days, and 43% give up by the end of January.
- Hackathon advice: Start with a problem you have faced and build a unique solution.
- The hackathon features five thematic tracks: Productivity and Work Habits, Personal Growth and Learning, Social and Community Impact, Health, Fitness and Wellness, and Financial Health.
- It's recommended to focus on one category to create a high-quality, specialized app.
- There is an overall challenge for the 'Best use of OPIC' for projects showcasing excellent evaluation and observability.
Highlights
"AI evals are what turn that sort of really cool, fun prototype into a system that you can actually iterate on with confidence and start to think about exposing to the real world and to real people."-- Abby"What's even cooler than a really cool prototype is, in my opinion, a cool tool that you can actually use in the real world on real data. And for that, you need some sort of system of evaluations."-- Abby"Evaluation turns guesswork into science."-- Abby"AI evaluations are how we turn random iteration into progress, and they give us a repeatable way to measure behavior, compare changes, and improve systematically instead of guessing."-- Abby
Chapters & Topics
Hackathon Overview and Goal
The Commit to Change AI Agent Hackathon is a four-week online event starting in January 2026, aimed at building AI agents and LLM-powered apps that help people stick to their New Year's resolutions and goals. It offers up to $30,000 in prizes.
- Keypoints
- The hackathon runs for four weeks online.
- The goal is to build AI/LLM apps to help users stick to their goals.
- There's a prize pool of up to $30,000.
- It is sponsored by Comet, with support from Basel and Google DeepMind.
- Participants must register on the Encode Cloud platform to get all updates and access resources.
- Explanation The hackathon is structured to guide participants from ideation to final submission. It includes workshops, deadlines to ensure progress, and resources provided by sponsors like Comet and partners like Google DeepMind. The central theme is leveraging AI to address the common problem of people giving up on their resolutions. Statistics show 23% of people give up 13 days into January, and 43% give up by the end of the month, highlighting the target audience for the projects.
Hackathon Tracks and Challenges
The hackathon is structured around five thematic tracks for projects, plus an overall challenge. The tracks are: Productivity and Work Habits, Personal Growth and Learning, Social and Community Impact, Health, Fitness and Wellness, and Financial Health. The overall challenge is for the 'Best Use of OPIC', rewarding projects with excellent evaluation and observability.
- Keypoints
- Productivity and Work Habits: Build tools for smarter work and better routines.
- Personal Growth and Learning: Build apps for learning new skills or developing self-awareness.
- Social and Community Impact: Build tools for organizing communities or supporting environmental/social action.
- Health, Fitness and Wellness: Build solutions for fitness, mental health, or general well-being.
- Financial Health: Build AI/LLM apps for budgeting, saving, or understanding money.
- Best Use of OPIC: A special category for projects with excellent evaluation and observability.
- Explanation Participants should choose one of the five themes to focus their project on. The themes are broad to accommodate a wide range of ideas. The 'Health, Fitness and Wellness' category is noted as a particularly common area for New Year's resolutions. In addition to the thematic challenges, all projects are eligible for the 'Best Use of OPIC' challenge, sponsored by Comet. This special prize can be won in conjunction with a thematic prize.
- Considerations
- Focus on one category rather than trying to build an app that spans multiple, as an app that does one thing well is better than one that does four things less well.
- Start with a problem that you yourself have faced to understand it well.
- Come up with a solution that sets you apart from everyone else; build something interesting and unique.
- Don't just do the bare minimum base idea; build on top of it.
- The best use of OPIC category can be won alongside a theme prize, which is the only two-in-one win possible.
Hackathon Timeline and Submission Requirements
The hackathon has a structured timeline with key deadlines and specific submission requirements for the final project. The timeline spans four weeks, starting with ideation, moving to building, and culminating in a final submission.
-
Keypoints
- Week 1: Ideation and Project Creation Deadline.
- Week 2: Building and a workshop on Gemini 3.
- Week 2/3: Mid-hackathon Deadline (submit GitHub repo and description).
- Final Submission Deadline: February 8th, 23:59 UTC-12.
- Required final submission items: Video pitch with demo and public code base.
- Recommended submission items: Hosted site and a presentation.
- The mid-hackathon deadline forces early and regular code commits.
-
Explanation The timeline is designed to keep participants on track.
-
Week 1: Launch, ideation, and a 'Project Creation Deadline' to commit to an idea and team.
-
Week 2: Start building, with a workshop from Google DeepMind on Gemini 3.
-
End of Week 2/Start of Week 3: 'Mid-hackathon Deadline' requiring submission of a GitHub repo and project description. This is not judged but ensures progress.
-
Final Submission Deadline: February 8th at one minute to midnight, UTC-12. This means you can submit if it's still before midnight anywhere in the world. Final submissions must include a video pitch with a demo, a public codebase, and optionally a hosted site and presentation slides. These requirements are designed to give judges a comprehensive view of the project.
-
Considerations
-
Do not leave commits until the last minute; make regular commits from the start.
-
The video pitch is the first impression for judges and needs to capture their attention immediately.
-
While AI can be used for presentations, it is not recommended to use AI to generate the entire video pitch as it can be 'soul destroying' for judges.
-
A hosted site and presentation are optional but highly recommended to strengthen your submission.
-
Do not submit at the last minute to avoid technical issues or mistakes.
Using the ENCODE Platform
The ENCODE platform is the central hub for the hackathon. It contains all necessary information, resources, and submission portals for participants.
- Keypoints
- The ENCODE platform is the home for the hackathon.
- Participants must register on the platform to get all emails and updates.
- To create a project and team, use the 'create project' and 'join team code' features.
- The platform has a 'Lecture' section with helpful videos.
- The 'Events' page lists all workshop links.
- Challenge descriptions and partner resources (Comet docs, Gemini credits, etc.) are available on the platform.
- All submissions are made through the participant's personal hackathon page on the platform.
- Explanation Participants will use the ENCODE platform for all hackathon-related activities. After the introductory session, participants who joined from Luma must register for the hackathon on the platform to receive emails and updates. The platform is where you create your project, add team members using a 'join code', find lecture videos, access event links for workshops, read detailed challenge descriptions, and find partner resources like the Comet developer documentation and OPIC quick start guide. All submissions, from the mid-hackathon check-in to the final project, will be done through this platform.
Definition of an AI Evaluation
An evaluation is a structured measurement of a system's behavior against criteria we care about. This involves defining the vision of success and failure for the application. The process is a structured way to measure an AI system's behavior against defined criteria on representative tasks to support decisions and improve product development.
- Keypoints
- It's a structured measurement of a system's behavior.
- It's measured against predefined criteria that define success.
- The hardest part is often figuring out the criteria you care about.
- It's important to think about success, failure, and potential failure modes from the beginning.
- Failure modes can vary, from harmful content and incorrect outputs to having the wrong tone for a brand.
- An evaluation system is often a set of many individual metrics.
- These individual metrics are aggregated into an overall success score.
- Explanation To unpack the definition: 'Structure' means repeatable and explicit, allowing for comparisons over time. 'Behavior' refers to what the system does, such as responses, tool calls, latency, or cost. 'Criteria' is how success is predefined, including accuracy, helpfulness, safety, or product-specific outcomes. An evaluation system is typically a set of many metrics aggregated into an overall success score.
The Four Ingredients of a Good Evaluation
A good evaluation generally consists of four main ingredients: a target, a test set/task, a scoring method, and a decision rule. These components work together to form a repeatable and useful evaluation process.
- Keypoints
- A target: The specific capability or outcome being tested (e.g., factual QA, customer support resolution).
- A test set or task: Examples representing the real world, including edge cases and outliers.
- A scoring method: A metric, rubric, or judge (human or LLM) to score performance, including how individual scores are aggregated (e.g., math equation, voting system).
- A decision rule: Determines what to do with the results, such as shipping a feature, rolling it back, or retraining. It often includes a threshold (e.g., 'if the success rate improves by at least 20%, then ship it').
- Explanation First, the 'target' specifies the capability or outcome being tested (e.g., factual QA, tool use). Second, the 'test set' includes real-world examples and edge cases. Third, the 'scoring method' defines how performance is measured (e.g., a metric, rubric, LLM judge) and how scores are aggregated. Finally, the 'decision rule' dictates the action to be taken based on the evaluation score, such as shipping a feature or rolling it back, often based on a threshold or comparison to a baseline.
Difference Between Benchmarks and Application Evals
Benchmarks are standardized test sets, often from academia, used for broad comparisons across different models. Application evaluations are specific to a product, matching its real distribution of prompts, workflows, and constraints, and are used to determine if a system is good enough to ship.
- Keypoints
- Benchmarks are standardized tests for comparing models.
- Application evals are product-specific tests for system performance.
- Benchmarks assess general ability; application evals assess behavior in a specific context.
- Benchmarks score the model in isolation; application evals test the entire system (prompts, RAG, tools).
- The biggest gap is distribution: benchmarks rarely match real traffic, edge cases, or domain language.
- A model can score high on a benchmark but fail in your specific application context.
- Benchmarks help narrow model choices; product evals tell you if the system is ready to ship.
- Explanation Benchmarks are useful for getting a quick read on a model's general capabilities and comparing model families (e.g., comparing one foundation model to another). However, a model can perform well on a benchmark and still fail in production because production success depends on behavior in a specific context. Application evals test the entire system—including prompts, RAG, and tool use—against your specific definition of success. The main gap is that benchmarks rarely match the distribution of real-world traffic, including edge cases and domain-specific language.
Challenges in Evaluating LLMs
Evaluating LLMs and agentic systems is inherently difficult due to several unique challenges that distinguish them from traditional software. These challenges include non-determinism, the subjectivity of tasks, high sensitivity to inputs, and the possibility of silent failures.
- Keypoints
- LLMs are non-deterministic: The same input does not guarantee the same output, making traditional monitoring difficult.
- Many AI tasks lack a single correct answer: Tasks like summarization or content generation can have multiple valid outputs, making simple match comparisons insufficient.
- Heuristic methods are largely ineffective: Simple heuristics like regex or pattern matching don't consider semantic meaning.
- Evaluation metrics are subjective: Concepts like relevance, coherence, and conciseness are open to interpretation.
- Human feedback is imperfect: It is expensive, can be inconsistent, and is difficult to scale.
- LLMs are extremely sensitive to prompts and context: Small changes can drastically alter the output.
- LLMs have silent failures: A system can produce a correct output but use flawed or unsafe reasoning to get there, which is a failure that isn't immediately obvious.
- There's no standard set of evaluation metrics applicable to all products.
- Explanation LLMs are non-deterministic, meaning the same input can produce different outputs, making hard-coded logic and simple error handling ineffective. Many tasks like summarization have no single correct answer, so simple matching (like regex) fails; semantic meaning must be considered. Metrics like 'relevance' are subjective, and even human annotators may disagree. LLMs are also very sensitive to small changes in prompts or context. Finally, they can have 'silent failures,' where the final output appears correct, but the underlying reasoning was flawed or discriminatory, which can only be caught by observing the entire process.
- Examples
Traditional software monitoring uses tools like try-and-accept clauses to handle anticipated errors. However, with LLMs, we cannot anticipate every single possible output they might produce due to their non-deterministic nature. Therefore, a simple accept clause is not a scalable or effective solution for handling LLM failures.
Challenges in Monitoring AI Agents
Monitoring AI agents is significantly more complex than monitoring standalone Large Language Models (LLMs) because agents are composed of multiple LLMs and external tools, leading to compounded variability and numerous failure modes.
- Keypoints
- Agents are built on top of LLMs, inheriting their non-deterministic nature.
- Chaining multiple LLM calls compounds variability and potential for error.
- Agents are multi-step workflows with more moving parts and thus more failure modes.
- The use of external tools introduces external dependencies, potential silent failures, and added unpredictability.
- Dynamic memory and context in agents can lead to intent drift and performance degradation over time.
- Evaluation of agents must include not just the final output, but also the reasoning, tool choice, and every step along the way.
- Explanation Agents are built on top of LLMs, which are non-deterministic systems. When you chain multiple LLM calls together, as is common in agents, the variability and potential for error at the beginning of the chain can be amplified through subsequent steps. Agents are also multi-step workflows with more moving parts, external tool dependencies (which can have silent failures), and dynamic memory/context. This complexity increases the number of failure modes and necessitates a more thorough evaluation process that goes beyond just the final output to include reasoning, tool choice, and each intermediate step. As agents are deployed in real-world scenarios with more complex workflows and higher usage frequency, tracking all these aspects becomes extremely difficult.
- Examples
A user engaged with a dealership's chatbot and managed to get it to agree to sell them a Chevy Tahoe for one US dollar. The chatbot even stated 'no takesies backsies'. This agreement was legally upheld in court, forcing the dealership to sell the car for one dollar.
- This example illustrates a real-world consequence of an AI agent (chatbot) going wrong.
- The non-deterministic and unconstrained nature of the LLM powering the chatbot led to an unintended and costly outcome for the dealership.
- It highlights the critical need for robust monitoring and evaluation of agent behavior to prevent such incidents.
- The phrase 'no takesies backsies' being considered legally binding underscores how interactions with AI can have unforeseen legal ramifications.
Hackathon Logistics and Rules
The 'Resolve to Evolve' Hackathon is an event where participants build AI or LLM-powered applications to help people maintain their New Year's resolutions. The event has specific rules regarding submissions, team formation, judging, and tooling.
- Keypoints
- Objective: Build an AI/LLM-powered app to help with New Year's resolutions.
- Team Formation: Solo or teams of any size are permitted (recommendation: 5 or less).
- Submissions: You can submit to multiple tracks but win at most one.
- Prizing: $5,000 for each of the five themes, plus a $5,000 special prize for the best use of OPIC.
- Required Tooling: OPIC must be used for evaluation.
- Allowed Tooling: Any LLM (e.g., Gemini, etc.) can be used.
- Submission deliverable: A video demo is a mandatory and critical part of the submission, presenting the project and its functionality.
- Project Scope: Build an MVP. Full production-ready apps are not expected.
- Timeline: The hackathon has started, and coding can begin now. It lasts approximately 28 days.
- Prior Work: You can build on a pre-existing project, but only functionality added during the hackathon will be judged.
- Explanation Participants are tasked with creating an MVP (Minimum Viable Product) of a web or mobile app. They can work solo or in teams of any size, though teams of 5 or fewer are recommended. Submissions can target multiple prize categories, but a project can only win one. Judging criteria are available on the hackathon platform. A key component of the submission is a video demo that serves as both a product pitch and a functional walkthrough. While participants can use any LLM, they are required to use OPIC for evaluation. The hackathon provides access to partner services with generous free tiers instead of specific credits. Participants can start coding immediately and can even build upon existing projects, but only work done during the hackathon period will be judged.
- Special Circumstances
- If building a mobile app that cannot be easily shared or hosted for judging, a very thorough demo video showing all functionality is sufficient.
- If you want to work on a project you have already started, you can, but you will only be judged on the new functionality and features built during the hackathon period.
Assignments & Suggestions
- If you are joining from Luma, go into your programs page and register for the hackathon after the session.
- If you have questions that aren't answered live, put them in the Q&A to be answered in the Discord.
- Decide on the project you want to build, the theme to focus on, the solution to come up with, and what makes it special during the ideation stage in week one.
- By the project creation deadline at the end of week one, create and commit to the project idea, the challenge theme, and add team members.
- Start the building process in week two.
- For the mid-hackathon deadline, submit your publicly available GitHub repo and a fuller description of what you're building.
- For the final submission deadline on February 8th, submit a video pitch including a product demo, your public code base, a hosted site (optional but recommended), and a presentation (optional but recommended).
- Watch the helpful, short lecture videos on the platform to prepare for workshops and the wider hackathon.
- Run the code from the QR code provided to see how evaluations are created and look in OPIC. The QR code leads to a simple recipe generator agent in a GitHub gist. You need to copy the script, follow the directions for creating online evaluations, create a Comet account if you don't have one (it's free), and add your API key.