scripod.com

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

In this insightful conversation, Hamel Husain and Shreya Shankar break down the essential practice of AI evaluations, emphasizing their role as a foundational discipline for building effective AI-powered products. They move beyond buzzwords to reveal how structured, human-led evals enable teams to systematically understand model behavior, uncover hidden failure modes, and drive meaningful product improvements.
The discussion outlines a step-by-step framework for creating impactful AI evals, starting with manual error analysis using real-world traces and open coding to identify recurring issues. The hosts stress that LLMs cannot replace human judgment in early-stage analysis due to context limitations. They introduce axial coding to categorize errors and recommend reaching theoretical saturation before scaling. The episode contrasts code-based evals—ideal for deterministic checks—with LLM-as-judge approaches for nuanced assessments, stressing the need to validate these judges against human consensus. Evals are positioned not as one-time tests but as evolving specifications that guide product development, surpassing traditional PRDs in dynamic AI systems. Common pitfalls include over-reliance on automation, skipping deep error analysis, and misinterpreting 'vibes' as validation. With strategic focus on high-impact failure modes and minimal ongoing effort post-setup, teams can implement robust evals efficiently, ensuring AI products deliver real user value.
05:00
05:00
Evals help move beyond vibe checks by providing measurable feedback for AI applications
16:01
16:01
Looking at data is crucial when doing data analysis of an LLM application
22:27
22:27
AI told a user about non-existent virtual tours, revealing a hallucination error
23:54
23:54
Manual open coding is crucial because LLMs lack context for accurate error assessment
25:23
25:23
One trusted person with domain expertise should lead AI evaluations to avoid overcomplication.
31:05
31:05
Theoretical saturation is reached when no new problem types are discovered during analysis.
31:39
31:39
Axial codes act as failure-mode labels to cluster and identify the most common AI errors.
44:39
44:39
17 conversational flow issues were identified using a pivot table analysis.
46:06
46:06
Dumb engineering errors in AI, like formatting mistakes, don't require full evals—they're obvious to fix.
51:05
51:05
LLM judges can reliably output pass or fail results for complex AI behaviors.
52:10
52:10
Use binary yes/no judgments instead of rating scales for reliable LLM evals
57:19
57:19
High agreement percentages between LLM and human judges can be misleading when errors are rare.
1:03:19
1:03:19
Experts can't anticipate all failure modes in LLM output validation
1:05:09
1:05:09
Fixing problems doesn't always require writing an eval.
1:07:41
1:07:41
Product managers can build profitable products using the skill set of implementing LLM judges for systematic improvement.
1:09:57
1:09:57
Strong opinions against AI evals often ignore their widespread practical use in development.
1:17:48
1:17:48
A/B tests should be powered by actual error analysis, not hypotheticals
1:18:26
1:18:26
Evals are essentially data science for understanding AI product performance.
1:22:30
1:22:30
More people should adopt structured approaches to application-specific evals.
1:23:02
1:23:02
There's high demand for Hamel and Shreya's Maven course.
1:29:59
1:29:59
The goal of evals is to improve the product, not just catch bugs.
1:33:19
1:33:19
AI sending factually correct emails isn't good enough—product thinking is essential for real effectiveness
1:36:30
1:36:30
Students get 10 months of free, unlimited access to all course-related AI content and resources
1:40:57
1:40:57
Hamel's life motto is 'Keep learning and think like a beginner'