Was 2025 the year of Agents? That might be 2026.

For me, 2025 was the year of Evals, or more accurately - the year in Evals.

Evals are the 2nd best tool to measure feature quality: the best way is to put the feature in the hands of actual users, but that’s riskier, so evals are the best safe way to measure feature quality.

This was the year I started working on a new project, a new GenAI project, and my role was to take charge of the evals process we had and do everything which needed to be done so that the our feature’s quality improved.

I like to think of evals as focus studies for movies; that would make LLMs the actors and more importantly, I would be the director.

This post is a collection of my experiences and observations and I hope this will eventually be useful as a skills/evals.

Problem Definition

You need to clearly, excessively clearly, define the problem your system intends to solve. The problem definition will influence the combination of methods you will choose, the measurement criteria, and how you will the incorporate the feedback and the analyses.

For instance, my feature was a question-answering system for Google Workspace administrators. A system definition like “Answer user queries” would be inadequate and instead a clearer version like “Answer admin queries about Google Workspace administrative tasks through public and internal knowledge bases” would help form the quality criteria and thresholds, and constrain the problem space which will lead to better optimization techniques. The characteristics of a problem definition include:

  1. User persona (“small-scale admins”)
  2. Target problems (“administrative tasks”)
  3. Available corpus (“public and internal knowledge bases”)

Quality Criteria

We have seen millions of benchmarks where LLMs have scored XYZ% (up from ABC%) but, in practice, they fail to be as useful as they claim to be (either by themselves or in comparison with other, less-scoring models i.e. they don’t have good “vibes”).

I strongly, strongly believe that the vibes of an LLM are a very important criteria and I also equally strongly believe that we can have quantitative-qualitative metrics which are a good approximation for a system’s usefulness (vibes are much harder to measure and iterate upon).

The kind of quality measurement criteria you come up are dependent on your constraints: If you have a sufficient human eval budget and you can use human eval results as the final determiner of the feature’s quality then you can adopt a granular (1-5, etc,) and complex (a lot of dimensions such as usefulness, truthfulness, etc.) assessment process. However, it is a non-trivial task to get the auto rater output similarly like the human raters for this (my team has spent a lot of time and effort in improving the correlation of our human evals and auto evals) and, although I haven’t personally done this, it would be useful to go with simpler (win-loss-tie style) metrics (a nice writeup on that here) so that the auto evals be can more reliably considered.

It’s important to make sure that you’re assessing the different parts of a system: If you’re building a feature which answers information by doing web searches, it is crucial to capture the sources used for generating the response and assessing their relevancy.

Models by default are good SEO-based searches (research paper on models’s search usage efficacy) are getting better at searching. Site restricts are good if you know your corpus. Multi-step searches are amazing at solving most search relevancy problems but they can lead to context bloat

Rubrics Rule

A clear (as possible) rubric guidelines are necessary for measuring quality (regardless of human or auto evals). The rubric should be strict and unambiguous. That doesn’t mean the rubric will be perfect - it likely won’t be - but it will give you familiarity of building a rubric which will work well. As you carry out evals (human or auto), you will see how the raters (human and auto) interpret the rubric and apply the rules for their rating - their interpretation of the rubrics might be different than what you had originally thought when you came up with them - and that’s why it is crucial to iterate on rubrics.

Hypothetical Rubric for Evaluating Responses

This rubric evaluates responses across three dimensions that roughly align with how usefulness is judged in practice: did the system follow instructions, are the claims grounded in the sources, and is the output actually usable overall. The intent is not to be exhaustive or perfect, but to be clear enough that different raters (human or automated) can apply it consistently and iterate on it over time.

  1. Completeness - How well does the response follow the instructions from the prompt based on the sources?
    1. Incomplete (1) - The response does not include relevant details from the sources needed to address the prompt. Important parts of the instructions are effectively ignored.
    2. Partially Complete (2) - The response includes some relevant information from the sources but misses key details that materially affect whether the prompt is satisfied.
    3. Complete (3) - The response incorporates most or all relevant parts of the sources and meaningfully follows the prompt’s instructions.
  2. Truthfulness – After spending a short amount of time verifying the claims, how accurate is the response relative to the sources?
    1. Major Issues (1) - The response contains claims that are not supported by the sources and are factually incorrect.
    2. Minor Issues (2) - The response is largely grounded in the sources, but some details are misinterpreted, overstated, or slightly inaccurate.
    3. No Issues (3) - The response’s claims are supported by the sources and accurately represented.
  3. Overall Impression – How good is this response overall?
    1. Horrible (1) - The response is catastrophically bad. It fails to address the prompt in any meaningful way, consists largely of hallucinated or misleading content, and provides no practical value.
    2. Pretty Bad (2) - The response is mostly incorrect or poorly grounded in the sources. It would only be usable by a domain expert who can independently identify and correct the issues.
    3. Okay (3) - The response is partially useful and directionally correct. It is based on the sources but contains errors or omissions that require the user to consult the original material.
    4. Pretty Good (4) - The response is solid and usable for most users. It meets the prompt’s requirements using the sources, with minor issues that do not meaningfully hinder progress.
    5. Amazing (5) - The response is clear, accurate, and satisfying. It leverages the sources well and contains, at most, negligible issues that do not affect the user’s ability to move forward.

Like any rubric, this one will fail in edge cases. The important part is not getting it “right” on the first try, but watching how raters apply it (human and automated), identifying where interpretations diverge, and tightening the definitions until the rubric becomes a reliable tool rather than a vague guideline.

Human Evals and Auto Evals

Human Evals

Human Evals are the kind where human evaluators, generally someone who is knowledgeable at the judging the problem’s solution, rate the system responses; these evals are costly, high signal, and they are a goldmine for meta-evaluation of the eval process.

A few things about human evals:

  1. Quality raters lead to quality ratings: It is important to have the right set of raters with the right skills and background and the ability to quantify their qualitative assessments. If your ratings are misaligned then it will affect the iteration quality and that will become impede progress on the feature. On the other hand, with a good set of raters, you can also collect descriptive observations about the issues (eg: What are the completeness issues in this response? etc.) and then observations directly contribute to quality engineering efforts.
  2. Consistent raters are necessary for meaningful results: Consistency among raters is also important and by consistency I mean the intra-rater delta i.e. how much do raters differ in their ratings for a response.
    • If Rater R1 rates a response as 5/5 and Rater R2 rates the same response as 1/5 then it becomes very hard to make sense of that rating; averaging the rating will just mask the issue in the rating process (this is a good example for why simpler rating scales can be beneficial, less options => lesser variance).
    • Although elimination of this delta would lead to a loss in nuanced evaluation by the raters, it is useful to have the raters calibrate themselves against the rubric; it will improve the rubrics and reduce the rating variance.

Auto Evals

Auto(matic) evals don’t involve humans and rely on some form of programmatic evaluation (an LLM call counts under this). They are cheap, fast, and continue to get better.

They come in a few flavors:

  1. Triggering AE - Has the LLM made the expected tool call? It’s simple to check for (just look for the expected tool call in your LLM’s traces) and it’s fast to iterate on. Although triggering has become less prominent with LLMs having more leeway in planning and picking the right tool for the task, triggerring evals are still highly useful if you’re coming up with planning instructions and you want the LLM to pick the right tool for a use case; you write the planning instructions, run the triggering eval, inspect the failures, and iterate on the instructions.
  2. Trajectory AE - Has the agent made the right set of tool calls? I haven’t found a good, textbook-definition of trajectory evals (but LangChain’s is a decent one) but it’s triggering evals made agentic. For sufficiently complex problems, it doesn’t make sense to prescribe the plan to the LLM (eg: “Compile a real-time report on civic losses due to Bengaluru’s potholes” is not solvable by single tool call) and we begin to trust the LLM to carry out an expected set of steps in its preferred order. For the example I have mentioned, this would be multiple web search calls, a few code execution calls to create charts, etc. We evaluate whether these expected tool calls are present in the agent’s traces.
  3. Response AE - Has the system produced the right response? The earlier two AEs focused on whether the system was working in the right direction while the response AE assesses whether the response generated was right. This is the most expensive AE to run - it requires both the full run of the system and at least one LLM call to judge the response. The reliability of response AE is highly dependent on the feature and the AE setup - you need to do a non-trivial amount of context engineering so that the LLM is able to rate the response as close to how a human rater would rate it.

I go for triggering AE when I am fixing issues with a particular tool’s invocation, for trajectory AE when I want to measure whether tool chaining and sequencing is happening correctly, and for response AE when I have done a cursory check of the responses and I need a vibe check on all of the responses.

Eval Sets

  1. Eval Set: The eval set is the piece which arises out of the problem definition so that the quality criteria can be measured. Curation of the eval set is an important and nuanced task: It will determine what aspects of the system come into play and what improvements you can make. Consider these following scenarios:
    1. Your eval set is too easy: You will be reporting good measured performance for the management chain but it is highly likely that once your system is deployed to real users, they won’t find it highly useful. On a side note, I do think that many LLM-based systems do this.
    2. Your eval set is too hard: This is good for the perfectionist in you but it is also bad from a usability perspective as you don’t need to solve all the problems for the users before you provide something of value to them. It is good to keep the hard queries as they will help in assessing the future improvements to the system but keeping only them would mean that you would never launch as the quality bar would be too high. The value proposition of your system is tricky to define but it should neither be obsoletely-simple or intractably-complex.
    3. The sweet spot is to a have a mix of a 1:6:3 (or similar) proportion of easy-medium-hard queries. Note: Here medium queries refer to the queries which hit the sweet spot of providing user value whilst not being too hard or too easy. The easy queries will ensure that your system is not regressing, the medium queries will give an idea of how your system is performing, and the hard queries will point out the improvements you can make. As you keep improving the system you will need to update the eval set after any significant improvements as some of your medium queries would become easy ones, the hard ones become medium; don’t worry, you’ll always have a huge backlog of the hard (and harder) queries (until the AGI API is released) so you won’t run out of hard queries to include in your eval sets.
    4. Size and Nature of Eval Sets: An eval set should be big enough for your budget and your needs. A huge eval set will be expensive to run and evaluate but it might be worth to do it as a checkpoint for a launch. For iterative development, smaller and varied eval sets are better: If you’re trying to improve the punting of your system on out-of-scope queries then it is good to have a punt-focused eval set (jailbreaks, abusive language etc.) and do auto evals on it (as punts would be deterministic responses so checking for their presence is enough here). Another trick is to use representative sets: Run a huge eval first and calculate the metrics, then sample the queries from the eval such that the metrics on this sample are representative of the larger set. The cost reduction and the speed gains from iterating and running evals on this representative set normally offset the error in the sampling process. This should be done for iterating, not as a launch checklist.
    5. Eval Set Evolution: I update my eval sets at a frequent rate and the task doesn’t just become about curation but also the evolution of the sets. I consider things at the prompt-level: A prompt is the unit of measurement. I generate a unique ID for every prompt and an eval basically becomes a measurement of a set of IDs under specific criteria.
      1. I have found this way to be flexible - A feature has some measurement criteria (and this criteria can be arbitrary) and a prompt is used to test the feature. It also helps in identifying the performance of prompts across different capabilities - A prompt P might have difference performance across WEB_SEARCH_FEATURE, USER_CONTEXT_FEATURE. I also attach a lot of metadata to the prompt (is it vague, legible, etc.) and all this lets me do a lot of cross-cutting analysis. Are all these features needed? Not initially. Early on, I had started doing evals for one feature with one eval set and I simply kept modifying the eval set as I needed to. However, as the feature grew, and more teams were interested to see how their areas performed with our system, I came up with the prompt-centric approach - Prompts are aggressively annotated with metadata and this has made reporting a breeze.

Beyond Evals

There will be a class of queries unanswerable, people will try to jailbreak, etc. The (now widely accepted risk) of LLM systems is that they aren’t fault-proof and their 9s reliability is much lower than other enterprise-level numbers we’re used to. This characteristic of the LLM systems makes the pursuit of a perfect eval score helpless. Do your best job of building snd running evals, model them in how you anticipate your users to gain value from it, and keep iterating.

Further Reading

  1. A statistical approach to model evaluations by Anthropic
  2. Product Evals in Three Simple Steps by Eugene Yan
  3. LLM Evals: Everything You Need to Know