To pressure-test the program's design, we built a simulated cohort — 100 host organizations, 300 applicants, 50 mentors — and lived its first month, week by week. Nobody in it is real. What it surfaced about the design is.
Everything here comes from a simulation. The host organizations, fellows, and mentors are synthetic profiles we generated and grounded in research — not real people, and not real outcomes. The numbers describe how the model behaved, not what will happen to a real cohort.
That is the point. A simulation lets us stress the design and find where it breaks before we spend a real fellow's year finding out. Read the findings as "here is where the structure strains," not "here is what will happen." We say what the model can and can't tell us in the final section.
Claude Corps places early-career builders into nonprofit host organizations for a year, supported by a CodePath mentor, the host's own manager, and Anthropic technical staff. The model rests on a few load-bearing bets: a 1:24 mentor-to-fellow ratio, a pod structure that groups each fellow with a host and a mentor, and a training design that front-loads common skills and then delivers the rest just in time.
Those bets are reasonable. They are also untested. The honest risk with any program design is that it looks sound on a slide and fails in week three, when a real person is the one paying for the gap. We wanted to find the gaps first.
So we asked a narrow, answerable question: if we run one cohort's first month under realistic conditions, where does this design hold, and where does it strain? Not "is it good" — "where does it break, and what would we change before it does."
A simulation is only as honest as what it's grounded in. If we had invented the host organizations and fellows from our own heads, we'd have learned only what we already believed. So the population came from research first.
We pulled from comparable programs and the literature that describes the conditions Claude Corps will actually operate in: AI residencies and fellowships (the Anthropic Fellows program, US Digital Service, Coding it Forward, Code for America), nonprofit-sector technology adoption studies, and management research on the organizations that will host fellows — including findings on nonprofit leadership gaps, manager tenure, and the roughly one-in-five annual turnover common in the sector.
That research became schemas — controlled vocabularies for the traits that matter: a host manager's supervision capacity and bench depth, a fellow's shipping habits and AI fluency, a mentor's coaching style and ceiling. Every profile was generated against those schemas with deliberate diversity, so the cohort looked like a plausible real one — including its uncomfortable parts. Some host managers are enthusiastic but absent. Some fellows ship fast and skip review. Some are placed a level beyond what their mentor can coach. We built those in on purpose, because those are the conditions that test a design.
The failures you'll read about later aren't accidents we stumbled into. They're the predictable consequences of conditions we know exist in the real world — absent managers, over-stretched fellows, quiet under-reporters — playing out against the program's structure. The simulation's job was to show us how the design responds to them.
From the research population, we ran the program's actual selection and matching rubrics: 300 applicants narrowed to 100 fellows, 100 candidate hosts to about 50, 50 mentor applicants to 5. Matching formed 48 pods. Then we pressed play.
Each tick is one week. Every fellow carries a compact state record — are they on track, what's their mentor load, what's stuck and why — and a smaller sample keep first-person reflection journals, so the run has both numbers and a human voice. The month runs from T0 (remote pre-program) through basecamp, deployment, and the working weeks, to T5 — the just-in-time training week and the Day-30 milestone.
Halfway through, we turned on the part that makes this more than a spreadsheet: a feedback layer. The people who support a fellow — Anthropic's applied-AI engineers at the pod level, customer-success managers at the cohort level, and the program's Senior Director — observe what surfaces each week and intervene. Their interventions then change the next week. Crucially, we built in three honest constraints:
Help takes a week to land. Its effect depends on whether the lever matches the cause — you cannot coach away a manager who isn't there. And the support staff can only act on what reaches them through a channel: a fellow raising their hand, an open escalation, or a mentor noticing in a 1:1. A fellow who quietly reports they're fine, with a mentor too stretched to read between the lines, stays invisible.
That last constraint came directly from a simple observation: nobody monitors 100 people's true condition every week. Real support runs on what surfaces. Building that in is what let the simulation show us its sharpest finding.
The model marks each fellow on_track, at_risk, or stalled. Those are system states, not labels for people — and for any tool real fellows would see, we'd choose more supportive language. We keep the terms here only because they're the model's vocabulary and precision helps.
Four questions drove the run. The answers were consistent, and a few were uncomfortable in a useful way.
The clearest finding needs no interpretation. Mentor load climbed every week of real work and never came back under capacity. By week two every mentor was over 100%, with zero slack in the system. It still sat at 128% at month-end — with 93 of 100 fellows healthy.
That last part is the point. The load stays high even when almost nobody is in crisis, because the baseline routine — roughly half an hour of one-to-one time per fellow per week — already exceeds a mentor's available hours at 24 fellows. The ratio is over-subscribed at the floor. No amount of good support technique closes a gap that exists before the first problem appears.
Mentor load across the month
Cohort-wide utilization vs. a mentor's weekly capacity (100%)
We asked, at month-end, whether each mentor would sign up for a second cohort. The answers track the load almost exactly. The mentor carrying 27 fellows — the program's most capable — is the clearest flight risk. The mentor whose pod drew an absent host manager would only return if host selection improves, because her best skill, catching trouble early, bought information she couldn't act on.
Mentor sustained load and renewal signal
Each mentor's month-end utilization, colored by whether they'd re-up
Two fellows in the run reported they were fine for two straight weeks while quietly falling behind. Their mentors were too loaded to read them, and nothing in the support chain catches a person who doesn't raise a hand. They surfaced only when their work broke — and by then they had slipped from flagged to stalled.
When help finally arrived, it worked — but only partway. They recovered to "flagged," not to "fine," by the Day-30 milestone. Compare that to a fellow caught early in the same kind of trouble, who recovered completely. Same intervention. Different outcome. The only variable was two weeks.
The cost of late detection
Two trajectories through the same kind of trouble, caught at different times
The just-in-time training week did exactly what it was designed to do: it closed every skill gap it touched. The fellows who shipped fast and skipped review — once paired with a human reviewer and given the training — recovered cleanly.
But seven fellows ended the month still flagged, and not one of them had a skill problem. Their issues were structural — the kind no curriculum reaches.
The seven fellows still flagged at Day 30, by cause
None is a training problem. Each needs a staffing or selection fix.
We ran the whole month twice — once with 8 cohort-level support managers, once with 5 — to see whether the leaner staffing degraded outcomes. It made no difference to the crises at all. The two runs were identical, week after week.
The reason is instructive. Crises route to one of two places: a single dedicated "high-touch" support track (the same one person in both versions) or the pod-level technical staff. Neither changes with the cohort-manager count. The extra managers in the larger roster carry routine work — content, community, check-ins — not emergencies. So trimming them would thin routine support, not crisis response. The real bottleneck was the single high-touch track, which was overwhelmed in both versions.
Every time the right person pulled the right lever on a problem they could see in time, the fellow recovered.
The interventions are sound. We don't need better tools. What the month exposed is that the model has four failure points that no intervention reaches: the 1:24 ratio that's over capacity at the floor, a single-threaded crisis track, late detection of fellows who don't raise their hand, and host-selection misses that put fellows under managers who were never there. Those are structural. You fix them with staffing and selection, not with effort.
Each of these comes straight out of a failure the run made visible. Together they target the structure, where the leverage actually is.
A simulation earns trust by being clear about its limits, so here they are plainly.
The people aren't real, and neither are the outcomes. No fellow recovered or stalled; a model did. The value is in the dynamics it exposes — where load concentrates, how detection lags, which fixes are structural — not in any specific count.
We chose the conditions. We set how often absent managers, over-stretched fellows, and quiet under-reporters appear, grounded in research but calibrated by us. Change those rates and the numbers move. The structural findings are robust to that; the exact figures are illustrative.
It's deterministic and bounded. The run covers one cohort's first month — pre-program through Day 30. It doesn't model the second basecamp, renewal decisions, or longer-run attrition. Those would be a separate study.
It complements real piloting; it doesn't replace it. The point was to find where to spend attention before a real fellow's year is on the line. The Colorado pilot and Cohort 1 itself remain the real tests. The simulation just means we walk in already knowing the four places to watch.
We built this to ask whether the 1:24 model works. It does — the interventions recover fellows reliably. What it can't do is outrun its own structure. Fix the ratio, the crisis-coverage bottleneck, the detection lag, and host selection, and the model is strong. Leave them, and the same handful of fellows will struggle in every cohort, no matter how good the people supporting them are.