bonsai

Cultivate AI
you can trust.

AI systems don't pass or fail — they perform within distributions. Bonsai teaches you to measure those distributions, calibrate the judges that score them, trace what your model actually does, and prune releases on signal you can defend in a postmortem.

Four ways in

Read, run, design, defend.

The four principles

Every lesson, lab, and design reduces to one of these. Internalize them — the rest is technique, wire, and patience.

01

Distributions over assertions

AI outputs are probabilistic. Single pass/fail tests lie. Measure with samples, score with rubrics, gate on statistical deltas.

02

Trace everything

What the model thought, which tool it picked, what it retrieved, what it returned. Outcomes alone don't tell you why a branch broke.

03

Loop production back to evals

Every shipped failure is a permanent regression test. Your triage queue is the only way the eval set actually gets good.

04

Respect your judges

LLM-as-judge is the most cost-effective scoring tool you have — and the most common source of false confidence. Calibrate against humans, or you're grading your own homework.

Start where you are

Every lesson stands alone. Pick the one that fits the tree you're shaping.

All 12 lessons →