Cultivate AI
you can trust.

AI systems don't pass or fail — they perform within distributions. Bonsai teaches you to measure those distributions, calibrate the judges that score them, trace what your model actually does, and prune releases on signal you can defend in a postmortem.

Start cultivating Try a live lab See the system designs

Four ways in

Read, run, design, defend.

Curriculum

12 lessons

Lessons that take root: from your first eval set to agent harnesses that survive contact with production. Built for engineers shipping AI, not marketing decks reading about it.

Open curriculum →

Interactive Labs

6 labs

Hands-on exercises wired to live Claude calls. Build a judge, score groundedness claim-by-claim, run prompt injections, and watch verbosity bias appear in real time.

Open interactive labs →

System Designs

5 designs

Reference architectures for eval pipelines, RAG harnesses, agent harnesses, observability, and CI/CD that gates on statistics, not green checkmarks.

Open system designs →

Quizzes

3 modules

Scenario quizzes that test whether you'd make the right call when an eval moves 3 points or your judge agrees with humans only 42% of the time.

Open quizzes →

The four principles

Every lesson, lab, and design reduces to one of these. Internalize them — the rest is technique, wire, and patience.

Distributions over assertions

AI outputs are probabilistic. Single pass/fail tests lie. Measure with samples, score with rubrics, gate on statistical deltas.

Trace everything

What the model thought, which tool it picked, what it retrieved, what it returned. Outcomes alone don't tell you why a branch broke.

Loop production back to evals

Every shipped failure is a permanent regression test. Your triage queue is the only way the eval set actually gets good.

Respect your judges

LLM-as-judge is the most cost-effective scoring tool you have — and the most common source of false confidence. Calibrate against humans, or you're grading your own homework.