Cultivate AI
you can trust.
AI systems don't pass or fail — they perform within distributions. Bonsai teaches you to measure those distributions, calibrate the judges that score them, trace what your model actually does, and prune releases on signal you can defend in a postmortem.
Four ways in
Read, run, design, defend.
Curriculum
Lessons that take root: from your first eval set to agent harnesses that survive contact with production. Built for engineers shipping AI, not marketing decks reading about it.
Interactive Labs
Hands-on exercises wired to live Claude calls. Build a judge, score groundedness claim-by-claim, run prompt injections, and watch verbosity bias appear in real time.
System Designs
Reference architectures for eval pipelines, RAG harnesses, agent harnesses, observability, and CI/CD that gates on statistics, not green checkmarks.
Quizzes
Scenario quizzes that test whether you'd make the right call when an eval moves 3 points or your judge agrees with humans only 42% of the time.
The four principles
Every lesson, lab, and design reduces to one of these. Internalize them — the rest is technique, wire, and patience.
Distributions over assertions
AI outputs are probabilistic. Single pass/fail tests lie. Measure with samples, score with rubrics, gate on statistical deltas.
Trace everything
What the model thought, which tool it picked, what it retrieved, what it returned. Outcomes alone don't tell you why a branch broke.
Loop production back to evals
Every shipped failure is a permanent regression test. Your triage queue is the only way the eval set actually gets good.
Respect your judges
LLM-as-judge is the most cost-effective scoring tool you have — and the most common source of false confidence. Calibrate against humans, or you're grading your own homework.
Start where you are
Every lesson stands alone. Pick the one that fits the tree you're shaping.
What is QA for AI?
Why testing non-deterministic systems demands a new playbook.
Designing evals that actually catch regressions
From vibes to a dataset that pays rent.
Human labeling and calibration
Your eval set is only as honest as the humans who labeled it.