The findings piece tells you what the simulation showed. This one is under the hood: the decisions we made, the trade-offs behind them, and the reasoning — written for anyone who might want to build one of these themselves.
You don't need to read the code to follow this. The goal is the thought process — the questions we asked at each fork, and why we went the way we did. If you finish it thinking "I could build one of these for a problem I care about," it worked. The distilled recipe is at the end.
A program design is an argument: "if we staff it like this and support it like that, fellows will thrive." Most of the time that argument lives on a slide, and you find out whether it's true by shipping it at real people. A simulation is a way to make the argument executable — to run it forward under realistic conditions and watch where it bends, before anyone's year depends on it.
That framing drove every technical choice. We weren't trying to predict outcomes; we were trying to build a thing that could surprise us — that could fail in ways we hadn't pre-decided. Five principles fell out of that, and they're the spine of this whole document: ground it in reality, structure the state, make it deterministic, build in the constraints that make success hard, and keep a human at every gate.
Before the decisions, the map. Six stages, each feeding the next. The arrows only go one way until the very end, where the analysis loops back to inform the next run.
The first temptation is to model everything. Resist it. A simulation that tries to capture the whole world captures nothing you can reason about. We started from the decision the simulation had to inform: is the 1:24 mentor ratio, the pod structure, and the just-in-time training design sound enough to commit to? Everything that didn't bear on that question, we cut or stubbed.
So we modeled one cohort's first month — not the full year — because the load-bearing bets show their strain early. And we treated the applicant funnel as a deliberate shortcut: a real cohort sees thousands of applicants narrowed to a few hundred finalists upstream, so we modeled only the final round (300 → 100) and called the first stage a black box. Naming that shortcut explicitly matters more than getting it "right" — a reader can see exactly what we did and didn't claim.
Start from the decision the simulation must inform, and model only what bears on it. Scope is a feature, not a compromise.
Here's the trap that kills most simulations: you invent the population from your own head, and then it can only tell you what you already believed. If we'd dreamed up the host organizations, they'd have been exactly as supportive as we hoped, and the simulation would have cheerfully confirmed our design.
So the population came from the outside world first. We researched comparable programs (AI residencies, public-interest tech fellowships) and the literature on the organizations fellows would actually land in — nonprofit management research, sector turnover data, leadership-gap findings. Then we deliberately seeded the uncomfortable cases that research says exist: managers who are enthusiastic but absent, fellows who ship fast and skip review, placements a level beyond what a mentor can coach. Those aren't bugs we tolerated; they're the conditions that test a design, built in on purpose.
Your synthetic population has to be able to surprise you. Seed the failure modes reality contains — especially the ones you'd rather not find.
Every fellow carries a compact, machine-readable state record every week — the thing you compute on. A smaller, deliberately-chosen sample also keeps a first-person reflection journal — the thing you read. Two tiers, two jobs.
# one fellow, one week — the unit the whole analysis runs on fellow_id: fellow-192 tick: T4 status_self_reported: on_track # what they say status_observed: stalled # what's true — the gap is itself a signal support_load_hours: 1.6 # feeds the 1:24 ratio finding manager_engagement: absent training_gap_state: { stuck: [{gap: .., attribution: host_ramp}] } latent_risks: [stretch_fit, high_coaching_load]
The split is the point. Analysis needs structure — you can't aggregate prose. Comprehension needs voice — nobody feels a spreadsheet. And you can't afford to write a journal for all 100 fellows every week, so you sample the cases that carry the signal (the strugglers, plus a few controls). The numbers and the human story stay in sync because the journals are pinned to the state, never the other way around.
Decide up front what you'll compute on versus what you'll read. Keep the computable layer complete and the narrative layer sampled — and pin the narrative to the state.
This is the single most consequential technical decision, and it's counterintuitive: the simulation engine uses no randomness and no live model calls. Each tick is a pure function from the previous tick's state to the next — plain rules we wrote, run the same way every time.
# T4 maturation: did last week's intervention actually move state? if lever in ("human_review", "tech_deepdive"): recovered = True # the missing piece, supplied elif lever == "manager_reengage": if manager == "absent": partial = True # structural — can't coach a no-show elif cause == "stretch": pass # wrong lever — no effect else: recovered = True
Why give up the realism of randomness? Three reasons. Reproducibility: anyone can rerun it and get the identical result, so a finding is a property of the model, not an artifact of one lucky run. Auditability: every transition is a rule you can point at and argue with — there's no dice roll to hide behind, which forces your assumptions into the open. And recalibration in seconds: when a reviewer said "that's too harsh," we changed one rule and re-ran the whole month instantly, identically. The cost is that the engine produces state, not prose — which is exactly why the texture is authored as a separate layer (Decision 08).
If you can't reproduce it exactly, you can't trust the finding. Determinism turns every assumption into something you can see, argue with, and re-run.
Halfway through the month we turned on a support layer: the people who help a fellow can observe what's happening and intervene. The danger with modeling "help" is that you accidentally build a wish-fulfillment engine where every problem gets noticed and solved. So we built in three constraints that make success hard, on purpose.
Help takes a week to land — nothing is fixed the instant it's noticed. Its effect depends on whether the lever matches the cause — re-engaging a manager who's merely overcommitted works; re-engaging one who is structurally absent does not, and aiming a manager-fix at a fellow who's simply over their head does nothing at all. And interventions cost the helper capacity — they're not free, so a support layer can saturate.
The most important choice came from a plain observation: nobody monitors 100 people's true condition every week. Real support acts on what surfaces — a raised hand, an open escalation, a mentor noticing in a 1:1. So an intervention can only fire if the trouble reaches a helper through a channel. A fellow who quietly reports "fine," with a mentor too loaded to read them, stays invisible — and the model is blind exactly where a real program is blind. That single rule is what produced the run's sharpest finding: the quiet fellows who weren't caught until their work broke.
The intervention lifecycle, and the gate in front of it
Trouble only becomes actionable if it surfaces through a channel — then help lands a week later, and only if the lever fits
A model that lets every problem get solved teaches you nothing. Encode the friction — delay, mismatch, cost, and blindness — and let it bite.
We wanted to answer "would more cohort-level support staff help?" — so we didn't argue about it, we ran the whole month twice, once with eight support managers and once with five, holding everything else constant. A clean A/B inside the simulation.
The two runs came out identical, week after week. That looks like a null result, but it's actually the finding: crises route to a dedicated crisis track (the same single person in both versions) or to pod-level technical staff — neither of which the cohort-manager count touches. The extra staff carried routine work, not emergencies. We'd never have seen that cleanly by debating it; we saw it because the comparison was built in.
Turn "does X matter?" into something the simulation answers, not something the room argues. Hold everything constant, vary one lever, and let the comparison speak — even when it says "no difference."
We never let the simulation run unattended from start to finish. Each week was generated as a draft for review, with the calibration knobs exposed — the at-risk count, the severity, the base rates for each failure mode. A reviewer looked, reacted, adjusted, and only then did the next week run. Because the engine is deterministic, "adjust and re-run" took seconds and produced an identical, inspectable result.
This matters because the model's value isn't the numbers — it's the conversation the numbers provoke. A human has to own the calibration choices ("no, two stalled fellows is too harsh for two weeks"), and a human has to decide which surprises are signal and which are artifacts. An oracle you can't interrogate is worse than no oracle at all.
A simulation is a thinking aid, not an oracle. Gate every step on human judgment, and make re-running cheap so judgment is easy to apply.
The deterministic engine gives you state, not feeling. To get the human layer — the reflection journals that make the run legible — we generated prose with parallel writing agents, one batch of fellows each. But the constraint was absolute: the prose is pinned to the state record and may never invent a fact. Each agent got the ground-truth status, the cause, the register, and was told exactly what could and couldn't be true for that fellow that week. Then we verified every journal against the state — right status, right arc, right voice.
That separation — deterministic facts, authored feeling, verified against each other — is what lets the journals be vivid without drifting from what the model actually said. A fellow reporting "I'm fine" while observed as stalled reads as genuine denial precisely because the state says so and the prose was written to it.
Generated narrative is powerful and dangerous. Let it render the truth, never invent it — pin every word to the state and verify.
The week-by-week viewer isn't hand-built; it's compiled from the run's artifacts. A build script reads every tick's state, rollups, and journals, bundles them into one self-contained file, and that's the dashboard. Re-run the script after a new tick and the viewer updates itself. No manual editing, and no chance of the visualization quietly drifting away from the data it's supposed to show.
Never hand-edit your outputs. Compile the dashboard from the run so it's always regenerable and always matches the data.
The last decision is editorial, and it's load-bearing. Throughout, we kept a hard line between structural findings (robust to our choices — "the ratio is over capacity at the floor") and illustrative numbers (artifacts of the rates we picked — "seven fellows flagged at Day 30"). We foregrounded the synthetic nature of the data and listed the limitations plainly.
This isn't humility for its own sake. The credibility of the real findings depends on not overclaiming the fake ones. A reader who catches you implying that a simulated number is a prediction stops trusting everything — including the structural insight that's actually solid. Honesty about the boundaries is what lets the strong findings land.
Separate what the model proves from what it merely illustrates, and say which is which. The discipline that limits your claims is what makes the real ones believable.
The simulation concluded that the levers work and the failures are structural. It could only reach that conclusion because the method was built the same way — deterministic levers we could trust, structural constraints we let bite, and a hard line between what was proven and what was assumed. Build the instrument honestly and it earns the right to an honest finding.
The whole thought process, in the order you'd actually work through it:
Make your design's argument executable, ground it in reality, run it deterministically with the friction turned on, keep a human in the loop, and be ruthless about separating what you proved from what you assumed. The rest is engineering.