What a Simulated Cohort Taught Us — Claude Corps

01 — The approach

Test the design, not our optimism

Claude Corps places early-career builders into nonprofit host organizations for a year, supported by a CodePath mentor, the host's own manager, and Anthropic technical staff. The model rests on a few load-bearing bets: a 1:24 mentor-to-fellow ratio, a pod structure that groups each fellow with a host and a mentor, and a training design that front-loads common skills and then delivers the rest just in time.

Those bets are reasonable. They are also untested. The honest risk with any program design is that it looks sound on a slide and fails in week three, when a real person is the one paying for the gap. We wanted to find the gaps first.

So we asked a narrow, answerable question: if we run one cohort's first month under realistic conditions, where does this design hold, and where does it strain? Not "is it good" — "where does it break, and what would we change before it does."

499

synthetic profiles — hosts, fellows, mentors, and support staff

pods formed through the real selection and matching rubrics

weekly ticks simulated, from pre-program through the Day-30 milestone

02 — The research

Built on the outside world, not our own assumptions

A simulation is only as honest as what it's grounded in. If we had invented the host organizations and fellows from our own heads, we'd have learned only what we already believed. So the population came from research first.

We pulled from comparable programs and the literature that describes the conditions Claude Corps will actually operate in: AI residencies and fellowships (the Anthropic Fellows program, US Digital Service, Coding it Forward, Code for America), nonprofit-sector technology adoption studies, and management research on the organizations that will host fellows — including findings on nonprofit leadership gaps, manager tenure, and the roughly one-in-five annual turnover common in the sector.

That research became schemas — controlled vocabularies for the traits that matter: a host manager's supervision capacity and bench depth, a fellow's shipping habits and AI fluency, a mentor's coaching style and ceiling. Every profile was generated against those schemas with deliberate diversity, so the cohort looked like a plausible real one — including its uncomfortable parts. Some host managers are enthusiastic but absent. Some fellows ship fast and skip review. Some are placed a level beyond what their mentor can coach. We built those in on purpose, because those are the conditions that test a design.

Why this matters

The failures you'll read about later aren't accidents we stumbled into. They're the predictable consequences of conditions we know exist in the real world — absent managers, over-stretched fellows, quiet under-reporters — playing out against the program's structure. The simulation's job was to show us how the design responds to them.

03 — What we built

A cohort that runs a week at a time

From the research population, we ran the program's actual selection and matching rubrics: 300 applicants narrowed to 100 fellows, 100 candidate hosts to about 50, 50 mentor applicants to 5. Matching formed 48 pods. Then we pressed play.

Each tick is one week. Every fellow carries a compact state record — are they on track, what's their mentor load, what's stuck and why — and a smaller sample keep first-person reflection journals, so the run has both numbers and a human voice. The month runs from T0 (remote pre-program) through basecamp, deployment, and the working weeks, to T5 — the just-in-time training week and the Day-30 milestone.

The support layer that acts

Halfway through, we turned on the part that makes this more than a spreadsheet: a feedback layer. The people who support a fellow — Anthropic's applied-AI engineers at the pod level, customer-success managers at the cohort level, and the program's Senior Director — observe what surfaces each week and intervene. Their interventions then change the next week. Crucially, we built in three honest constraints:

Help takes a week to land. Its effect depends on whether the lever matches the cause — you cannot coach away a manager who isn't there. And the support staff can only act on what reaches them through a channel: a fellow raising their hand, an open escalation, or a mentor noticing in a 1:1. A fellow who quietly reports they're fine, with a mentor too stretched to read between the lines, stays invisible.

That last constraint came directly from a simple observation: nobody monitors 100 people's true condition every week. Real support runs on what surfaces. Building that in is what let the simulation show us its sharpest finding.

One status label, named honestly

The model marks each fellow on_track, at_risk, or stalled. Those are system states, not labels for people — and for any tool real fellows would see, we'd choose more supportive language. We keep the terms here only because they're the model's vocabulary and precision helps.

04 — The outcomes

What the month showed

Four questions drove the run. The answers were consistent, and a few were uncomfortable in a useful way.

T0 · Pre-programRemote onboarding

on track flagged / at risk stalled

T0T1T2T3T4T5

Mentor load50%

Flagged fellows0

Reporting fine but flagged0

Drag through the month. Every figure on this page is drawn from the run's actual week-by-week data. The cohort reads calm at first and strains as real work begins.

RQ1 — The 1:24 ratio doesn't fit, with or without a crisis

The clearest finding needs no interpretation. Mentor load climbed every week of real work and never came back under capacity. By week two every mentor was over 100%, with zero slack in the system. It still sat at 128% at month-end — with 93 of 100 fellows healthy.

That last part is the point. The load stays high even when almost nobody is in crisis, because the baseline routine — roughly half an hour of one-to-one time per fellow per week — already exceeds a mentor's available hours at 24 fellows. The ratio is over-subscribed at the floor. No amount of good support technique closes a gap that exists before the first problem appears.

Mentor load across the month

Cohort-wide utilization vs. a mentor's weekly capacity (100%)

Capacity is the flat line at 100%. Load crosses it between week 1 and week 2 and never returns. The dip at the end is real work resolving — not the ratio getting easier.

RQ3 — A month at that load is not renewable

We asked, at month-end, whether each mentor would sign up for a second cohort. The answers track the load almost exactly. The mentor carrying 27 fellows — the program's most capable — is the clearest flight risk. The mentor whose pod drew an absent host manager would only return if host selection improves, because her best skill, catching trouble early, bought information she couldn't act on.

Mentor sustained load and renewal signal

Each mentor's month-end utilization, colored by whether they'd re-up

at risk of leaving conditional likely (with a cap)

Nobody finished the month under capacity. Renewal intent ranges from "only with a hard headcount cap" to "likely, if pods stay near 15–20." Sustainability at 1:24 is poor without a structural change.

The sharpest finding: we catch the quiet ones too late

Two fellows in the run reported they were fine for two straight weeks while quietly falling behind. Their mentors were too loaded to read them, and nothing in the support chain catches a person who doesn't raise a hand. They surfaced only when their work broke — and by then they had slipped from flagged to stalled.

When help finally arrived, it worked — but only partway. They recovered to "flagged," not to "fine," by the Day-30 milestone. Compare that to a fellow caught early in the same kind of trouble, who recovered completely. Same intervention. Different outcome. The only variable was two weeks.

The cost of late detection

Two trajectories through the same kind of trouble, caught at different times

caught early (week 2) caught late (week 4)

Detection, not the intervention, was the bottleneck. The support levers recover fellows reliably. What fails is seeing the quiet ones in time.

RQ2 — Training fixes skills; it does nothing for structure

The just-in-time training week did exactly what it was designed to do: it closed every skill gap it touched. The fellows who shipped fast and skipped review — once paired with a human reviewer and given the training — recovered cleanly.

But seven fellows ended the month still flagged, and not one of them had a skill problem. Their issues were structural — the kind no curriculum reaches.

The seven fellows still flagged at Day 30, by cause

None is a training problem. Each needs a staffing or selection fix.

Skill failures got fixed. Structural failures didn't. Absent managers, a fellow placed beyond her mentor's reach, and cases the support layer never had capacity to serve — training was never going to touch any of these.

RQ4 — More cohort staff wouldn't have helped the crises

We ran the whole month twice — once with 8 cohort-level support managers, once with 5 — to see whether the leaner staffing degraded outcomes. It made no difference to the crises at all. The two runs were identical, week after week.

The reason is instructive. Crises route to one of two places: a single dedicated "high-touch" support track (the same one person in both versions) or the pod-level technical staff. Neither changes with the cohort-manager count. The extra managers in the larger roster carry routine work — content, community, check-ins — not emergencies. So trimming them would thin routine support, not crisis response. The real bottleneck was the single high-touch track, which was overwhelmed in both versions.

8 = 5

Crisis outcomes were identical under both staffing levels, every week

A single high-touch crisis track — the actual bottleneck — in both versions

05 — What we'd change

Five fixes, before the real cohort

Each of these comes straight out of a failure the run made visible. Together they target the structure, where the leverage actually is.

Cap mentor headcount near 16–20, or add a sixth mentor. The 1:24 ratio is over capacity before the first problem appears. This is the root cause behind the load curve and the renewal risk.
Add a second high-touch crisis support manager. One person covering every hard case across the cohort is a single point of failure. It deferred live cases two weeks running, in both staffing versions.
Stand up a proactive scan for quiet, under-reporting fellows from week one. Read by communication style and rapport, not by waiting for a deliverable to break. This is what turns a two-week stall into an early catch.
Assign review partners to flagged "ships-without-review" fellows up front. Every fellow who hit this pattern recovered once paired with a reviewer. We know who they are at intake — pair them then, instead of after each one breaks.
Harden host-manager selection. No solo-managed, thin-bench placements. The cases that never recovered were host-selection misses visible at intake.

06 — What this is, and what it isn't

The honest boundaries

A simulation earns trust by being clear about its limits, so here they are plainly.

The people aren't real, and neither are the outcomes. No fellow recovered or stalled; a model did. The value is in the dynamics it exposes — where load concentrates, how detection lags, which fixes are structural — not in any specific count.

We chose the conditions. We set how often absent managers, over-stretched fellows, and quiet under-reporters appear, grounded in research but calibrated by us. Change those rates and the numbers move. The structural findings are robust to that; the exact figures are illustrative.

It's deterministic and bounded. The run covers one cohort's first month — pre-program through Day 30. It doesn't model the second basecamp, renewal decisions, or longer-run attrition. Those would be a separate study.

It complements real piloting; it doesn't replace it. The point was to find where to spend attention before a real fellow's year is on the line. The Colorado pilot and Cohort 1 itself remain the real tests. The simulation just means we walk in already knowing the four places to watch.

The one line to carry out of the room

We built this to ask whether the 1:24 model works. It does — the interventions recover fellows reliably. What it can't do is outrun its own structure. Fix the ratio, the crisis-coverage bottleneck, the detection lag, and host selection, and the model is strong. Leave them, and the same handful of fellows will struggle in every cohort, no matter how good the people supporting them are.

We ran a month of Claude Corps before a single fellow arrived.

How to read this