Mid-market companies — 50 to 500 employees, $10M to $200M in revenue — face a specific problem with AI adoption. They are big enough that ad-hoc experimentation wastes real money, but small enough that the "digital transformation office with a 12-month roadmap" approach used by Fortune 500s is wildly disproportionate. What they need is a small, bounded, time-limited way to find out whether a specific AI project actually works, before committing to anything larger.
The framework in this article is the one we run for clients at N40. It takes four weeks, costs a fraction of a full build, and produces either a working production agent or a clear "no" with defensible reasons. Either outcome is useful. The important thing is that by the end of week four, you know something you did not know at the start, and the knowledge cost is bounded.
Week 1: Problem selection
The single biggest determinant of pilot success is what problem you pick. Pick well, and the rest of the framework is mostly execution. Pick badly, and no amount of good engineering saves it. We screen candidate problems against five traits:
- Measurable: The current process has a number attached to it — time per task, error rate, cost per outcome, customer satisfaction score. If you cannot measure the baseline, you cannot measure improvement, and the pilot cannot succeed in any clear way.
- Repetitive: The task happens at least dozens of times per week, ideally hundreds. Rare tasks are bad pilot candidates even if they are painful, because you cannot gather enough data to tell if the agent is working.
- Bounded: The scope of the task is clearly defined. "Answer customer questions about shipping" is bounded. "Help customers with anything" is not. Bounded tasks are dramatically easier to build, test, and evaluate.
- Low stakes per instance: A single failure does not cause significant damage. This is not about avoiding accountability — it is about letting you learn. Agents that make high-stakes decisions need much more testing before going live.
- Data available: You have historical examples of the task being done correctly. Real examples, not hypotheticals. The agent will be built and evaluated against this data.
The output of week 1 is a one-page problem definition: what the task is, what the baseline metrics are, what success looks like, and 50+ real examples of the task being performed. If you cannot produce this document, the pilot should not start — you do not have enough clarity yet.
Week 2: Build the minimum viable agent
Week 2 is where the engineering happens. The rule is minimum viable: one integration, one model, one prompt, one evaluation set. Resist the urge to add features. Every feature added in week 2 is a feature you have to test and debug in weeks 3 and 4, and the whole timeline slips.
A typical week-2 build looks like this: the agent uses a single API (usually Claude or GPT), connects to one data source (the CRM, the ticket system, the knowledge base — whatever is closest to the task), handles one channel (web chat or email, not both), and has a prompt written against the 50 examples from week 1. We stand up an evaluation harness — usually a spreadsheet with test cases and expected outcomes — and iterate on the prompt until the agent handles at least 80% of the test cases correctly.
At the end of week 2, you have an agent that works on a test set. You do not have an agent that works in production yet. That is week 3.
Week 3: Shadow mode
Shadow mode is the part of the framework that most teams skip, and it is the most important part. The agent runs in parallel with the existing human process. Every time a human handles a task, the agent also handles it — independently, without its output going anywhere. You compare the outputs after the fact.
Shadow mode gives you two things no test set can. First, real distribution: the actual mix of tasks that come in, including the weird edge cases that nobody thinks to put in a test set. Second, honest accuracy: you see how the agent does on live inputs, not the ones you cherry-picked in week 2. We have watched agents that scored 90% on the test set drop to 65% on shadow traffic. That is the point of shadow mode — to catch that gap before it affects customers.
The metrics we track in shadow mode: agreement rate with human (does the agent produce the same answer the human did?), escalation rate (how often does the agent correctly decide it does not know?), and a qualitative review of disagreements (are the agent's disagreements actually worse than the human's answers, or sometimes better?). That last one is important — we have found cases where the agent was more consistent than the human baseline, which is useful to know.
By the end of week 3, you have real numbers on real traffic. You know whether the agent is ready for production or needs another iteration.
Week 4: Gated rollout with a kill switch
If shadow mode results are strong, week 4 is production. But production does not mean 100% of traffic on day one. It means a gated rollout: 10% of traffic goes to the agent, 90% to humans, for the first 2–3 days. If metrics hold, ramp to 50% for another 2–3 days. If metrics still hold, ramp to 100%.
At every ramp stage, there is a kill switch — a one-click way for a human operator to route all traffic back to the human process. This is non-negotiable. Any production deployment without a kill switch is irresponsible, regardless of how confident you are in the agent. The kill switch is what lets you ship boldly — because if something goes wrong, the rollback is instant.
The metrics to watch during rollout: agent accuracy (did it drop compared to shadow mode?), escalation rate (is it appropriate?), latency (is the agent responding fast enough?), and cost per task (is the economics still working?). Any of these going sideways is a signal to pause or roll back.
Success metrics and common failure modes
What does success look like? For a 4-week pilot, we consider it successful if the agent handles at least 60% of its scoped task autonomously, with accuracy at least matching the human baseline, at a cost that makes economic sense. Below 60%, the agent adds more work (reviewing its output) than it removes. Above 60%, it starts producing real leverage.
The common failure modes, in rough order of frequency:
- Scope creep in week 2. The team adds features that were not in the week-1 definition. By week 3 they are debugging features that are not on the critical path.
- Skipping shadow mode. Going straight from test set to production. The gap between test accuracy and real accuracy bites in production instead of in shadow.
- No kill switch. Deploying to 100% because "it looked fine in shadow." It almost always looks fine in shadow. Gated rollout is cheap insurance.
- Picking a problem that does not meet the five traits. Usually "measurable" is the one that fails — the team cannot actually define what success looks like, so every result is debatable.
None of these are exotic. They are avoided by discipline more than by cleverness. The framework exists mostly to make the discipline explicit.
We run pilots like this for mid-market clients regularly — the structure is mature, the timeline is realistic, and the worst case is you learn something useful for the cost of a single month of work. If you have a specific problem in mind and want to know whether it would make a good pilot, see our case studies or start a conversation at /contact.
