Claude Corps · Simulation · Technical companion

How we built a cohort that doesn't exist.

The findings piece tells you what the simulation showed. This one is under the hood: the decisions we made, the trade-offs behind them, and the reasoning — written for anyone who might want to build one of these themselves.

← Read the findings the executive analysis

Internal · Program Strategy · June 2026 · A reasoning guide, not a spec

The stance

A simulation is an argument you can run

A program design is an argument: "if we staff it like this and support it like that, fellows will thrive." Most of the time that argument lives on a slide, and you find out whether it's true by shipping it at real people. A simulation is a way to make the argument executable — to run it forward under realistic conditions and watch where it bends, before anyone's year depends on it.

That framing drove every technical choice. We weren't trying to predict outcomes; we were trying to build a thing that could surprise us — that could fail in ways we hadn't pre-decided. Five principles fell out of that, and they're the spine of this whole document: ground it in reality, structure the state, make it deterministic, build in the constraints that make success hard, and keep a human at every gate.

Decision 01

Decide what you're actually modeling

The first temptation is to model everything. Resist it. A simulation that tries to capture the whole world captures nothing you can reason about. We started from the decision the simulation had to inform: is the 1:24 mentor ratio, the pod structure, and the just-in-time training design sound enough to commit to? Everything that didn't bear on that question, we cut or stubbed.

So we modeled one cohort's first month — not the full year — because the load-bearing bets show their strain early. And we treated the applicant funnel as a deliberate shortcut: a real cohort sees thousands of applicants narrowed to a few hundred finalists upstream, so we modeled only the final round (300 → 100) and called the first stage a black box. Naming that shortcut explicitly matters more than getting it "right" — a reader can see exactly what we did and didn't claim.

The principle

Start from the decision the simulation must inform, and model only what bears on it. Scope is a feature, not a compromise.

Decision 02

Ground the population in research, not imagination

Here's the trap that kills most simulations: you invent the population from your own head, and then it can only tell you what you already believed. If we'd dreamed up the host organizations, they'd have been exactly as supportive as we hoped, and the simulation would have cheerfully confirmed our design.

So the population came from the outside world first. We researched comparable programs (AI residencies, public-interest tech fellowships) and the literature on the organizations fellows would actually land in — nonprofit management research, sector turnover data, leadership-gap findings. Then we deliberately seeded the uncomfortable cases that research says exist: managers who are enthusiastic but absent, fellows who ship fast and skip review, placements a level beyond what a mentor can coach. Those aren't bugs we tolerated; they're the conditions that test a design, built in on purpose.

The principle

Your synthetic population has to be able to surprise you. Seed the failure modes reality contains — especially the ones you'd rather not find.

Decision 03

Structure the state; let the story be a second layer

Every fellow carries a compact, machine-readable state record every week — the thing you compute on. A smaller, deliberately-chosen sample also keeps a first-person reflection journal — the thing you read. Two tiers, two jobs.

# one fellow, one week — the unit the whole analysis runs on
fellow_id: fellow-192
tick: T4
status_self_reported: on_track     # what they say
status_observed:     stalled      # what's true — the gap is itself a signal
support_load_hours:  1.6           # feeds the 1:24 ratio finding
manager_engagement:  absent
training_gap_state:  { stuck: [{gap: .., attribution: host_ramp}] }
latent_risks:        [stretch_fit, high_coaching_load]

The split is the point. Analysis needs structure — you can't aggregate prose. Comprehension needs voice — nobody feels a spreadsheet. And you can't afford to write a journal for all 100 fellows every week, so you sample the cases that carry the signal (the strugglers, plus a few controls). The numbers and the human story stay in sync because the journals are pinned to the state, never the other way around.

The principle

Decide up front what you'll compute on versus what you'll read. Keep the computable layer complete and the narrative layer sampled — and pin the narrative to the state.

Decision 04

Make the engine deterministic

This is the single most consequential technical decision, and it's counterintuitive: the simulation engine uses no randomness and no live model calls. Each tick is a pure function from the previous tick's state to the next — plain rules we wrote, run the same way every time.

# T4 maturation: did last week's intervention actually move state?
if lever in ("human_review", "tech_deepdive"):
    recovered = True                       # the missing piece, supplied
elif lever == "manager_reengage":
    if manager == "absent":  partial = True   # structural — can't coach a no-show
    elif cause == "stretch":  pass          # wrong lever — no effect
    else:                  recovered = True

Why give up the realism of randomness? Three reasons. Reproducibility: anyone can rerun it and get the identical result, so a finding is a property of the model, not an artifact of one lucky run. Auditability: every transition is a rule you can point at and argue with — there's no dice roll to hide behind, which forces your assumptions into the open. And recalibration in seconds: when a reviewer said "that's too harsh," we changed one rule and re-ran the whole month instantly, identically. The cost is that the engine produces state, not prose — which is exactly why the texture is authored as a separate layer (Decision 08).

The principle

If you can't reproduce it exactly, you can't trust the finding. Determinism turns every assumption into something you can see, argue with, and re-run.

Decision 05

Design the dynamics to generate findings, not wishes

Halfway through the month we turned on a support layer: the people who help a fellow can observe what's happening and intervene. The danger with modeling "help" is that you accidentally build a wish-fulfillment engine where every problem gets noticed and solved. So we built in three constraints that make success hard, on purpose.

Latency, efficacy-by-cause, and cost

Help takes a week to land — nothing is fixed the instant it's noticed. Its effect depends on whether the lever matches the cause — re-engaging a manager who's merely overcommitted works; re-engaging one who is structurally absent does not, and aiming a manager-fix at a fellow who's simply over their head does nothing at all. And interventions cost the helper capacity — they're not free, so a support layer can saturate.

The detection filter — the sharpest decision

The most important choice came from a plain observation: nobody monitors 100 people's true condition every week. Real support acts on what surfaces — a raised hand, an open escalation, a mentor noticing in a 1:1. So an intervention can only fire if the trouble reaches a helper through a channel. A fellow who quietly reports "fine," with a mentor too loaded to read them, stays invisible — and the model is blind exactly where a real program is blind. That single rule is what produced the run's sharpest finding: the quiet fellows who weren't caught until their work broke.

The intervention lifecycle, and the gate in front of it

Trouble only becomes actionable if it surfaces through a channel — then help lands a week later, and only if the lever fits

Build the constraints that make success hard. Latency, lever-cause matching, capacity cost, and the detection gate are what separate a finding-generator from a fantasy.

The principle

A model that lets every problem get solved teaches you nothing. Encode the friction — delay, mismatch, cost, and blindness — and let it bite.

Decision 06

Bake in the counterfactual

We wanted to answer "would more cohort-level support staff help?" — so we didn't argue about it, we ran the whole month twice, once with eight support managers and once with five, holding everything else constant. A clean A/B inside the simulation.

The two runs came out identical, week after week. That looks like a null result, but it's actually the finding: crises route to a dedicated crisis track (the same single person in both versions) or to pod-level technical staff — neither of which the cohort-manager count touches. The extra staff carried routine work, not emergencies. We'd never have seen that cleanly by debating it; we saw it because the comparison was built in.

The principle

Turn "does X matter?" into something the simulation answers, not something the room argues. Hold everything constant, vary one lever, and let the comparison speak — even when it says "no difference."

Decision 07

Keep a human at every gate

We never let the simulation run unattended from start to finish. Each week was generated as a draft for review, with the calibration knobs exposed — the at-risk count, the severity, the base rates for each failure mode. A reviewer looked, reacted, adjusted, and only then did the next week run. Because the engine is deterministic, "adjust and re-run" took seconds and produced an identical, inspectable result.

This matters because the model's value isn't the numbers — it's the conversation the numbers provoke. A human has to own the calibration choices ("no, two stalled fellows is too harsh for two weeks"), and a human has to decide which surprises are signal and which are artifacts. An oracle you can't interrogate is worse than no oracle at all.

The principle

A simulation is a thinking aid, not an oracle. Gate every step on human judgment, and make re-running cheap so judgment is easy to apply.

Decision 08

Author the texture — but pin it to the truth

The deterministic engine gives you state, not feeling. To get the human layer — the reflection journals that make the run legible — we generated prose with parallel writing agents, one batch of fellows each. But the constraint was absolute: the prose is pinned to the state record and may never invent a fact. Each agent got the ground-truth status, the cause, the register, and was told exactly what could and couldn't be true for that fellow that week. Then we verified every journal against the state — right status, right arc, right voice.

That separation — deterministic facts, authored feeling, verified against each other — is what lets the journals be vivid without drifting from what the model actually said. A fellow reporting "I'm fine" while observed as stalled reads as genuine denial precisely because the state says so and the prose was written to it.

The principle

Generated narrative is powerful and dangerous. Let it render the truth, never invent it — pin every word to the state and verify.

Decision 09

Make the dashboard a pure function of the run

The week-by-week viewer isn't hand-built; it's compiled from the run's artifacts. A build script reads every tick's state, rollups, and journals, bundles them into one self-contained file, and that's the dashboard. Re-run the script after a new tick and the viewer updates itself. No manual editing, and no chance of the visualization quietly drifting away from the data it's supposed to show.

The principle

Never hand-edit your outputs. Compile the dashboard from the run so it's always regenerable and always matches the data.

Decision 10

Stay honest about what it is

The last decision is editorial, and it's load-bearing. Throughout, we kept a hard line between structural findings (robust to our choices — "the ratio is over capacity at the floor") and illustrative numbers (artifacts of the rates we picked — "seven fellows flagged at Day 30"). We foregrounded the synthetic nature of the data and listed the limitations plainly.

This isn't humility for its own sake. The credibility of the real findings depends on not overclaiming the fake ones. A reader who catches you implying that a simulated number is a prediction stops trusting everything — including the structural insight that's actually solid. Honesty about the boundaries is what lets the strong findings land.

The principle

Separate what the model proves from what it merely illustrates, and say which is which. The discipline that limits your claims is what makes the real ones believable.

Why it worked

The method and the finding are the same shape

The simulation concluded that the levers work and the failures are structural. It could only reach that conclusion because the method was built the same way — deterministic levers we could trust, structural constraints we let bite, and a hard line between what was proven and what was assumed. Build the instrument honestly and it earns the right to an honest finding.

If you wanted to build one

The recipe, distilled

The whole thought process, in the order you'd actually work through it:

Name the decision it must inform. Scope to that and nothing else.
Ground the population in research. Seed the real failure modes, including the uncomfortable ones, so it can surprise you.
Structure the state; sample the story. Complete machine-readable state for everyone; narrative for the cases that carry signal.
Make the engine deterministic. Pure functions, no hidden randomness — so findings are reproducible and assumptions are visible.
Encode the friction. Delay, lever-cause mismatch, capacity cost, and a detection gate, so success has to be earned.
Bake in a counterfactual. Vary one structural lever, hold the rest, and let the comparison answer the question.
Gate on human judgment. Draft each step, expose the knobs, make re-running cheap.
Author texture, pinned to truth. Render the state as story; never let the story invent facts.
Compile the dashboard from the run. Regenerable, never hand-edited.
Draw the line between proven and illustrative. Foreground the limits; it's what makes the findings credible.

The shortest version

Make your design's argument executable, ground it in reality, run it deterministically with the friction turned on, keep a human in the loop, and be ruthless about separating what you proved from what you assumed. The rest is engineering.

How we built a cohort that doesn't exist.

Who this is for

A simulation is an argument you can run

The pipeline, end to end

Decide what you're actually modeling

Ground the population in research, not imagination

Structure the state; let the story be a second layer

Make the engine deterministic

Design the dynamics to generate findings, not wishes

Latency, efficacy-by-cause, and cost

The detection filter — the sharpest decision

Bake in the counterfactual

Keep a human at every gate

Author the texture — but pin it to the truth

Make the dashboard a pure function of the run

Stay honest about what it is

The method and the finding are the same shape

The recipe, distilled

The shortest version