The most common question we get from CTOs in the second month of an AI project is some version of "why did the API bill triple?" In the first month, the numbers look fine. By the third month, without intervention, they often do not. LLM costs have a specific pattern: they stay small while traffic is small, then scale non-linearly when usage grows, because the things that were inefficient at 100 requests per day become catastrophic at 10,000.
This article is about how to keep that from happening. The good news: for a well-designed agent handling 1,000–5,000 conversations per month, keeping the bill under $500 is not a stretch. Most of our clients operate in the $80–$400 range for that volume. The tactics below are the ones we apply on every project.
Why bills explode
Before tactics, the root causes. Almost every runaway LLM bill we have seen traces to four mistakes, usually stacked on top of each other:
- Wrong model for the job: Using GPT-4o or Claude Opus for tasks a small model would handle fine — classification, routing, short summaries. The price delta between a small and a large model is often 10–30x. If 80% of your calls do not need the big model, 80% of your bill is wasted.
- No prompt caching: Sending the same 2,000-token system prompt on every request. At production volumes, that is where the money goes. Prompt caching — explicit on Anthropic, implicit on OpenAI — cuts the cost of cached tokens by 90%.
- Context stuffing: Passing the entire conversation history, the entire knowledge base, and every tool description on every turn. Context grows linearly per turn, and cost grows with it. Most agents can work with a summarized history after 5–10 turns.
- No output limits: Letting the model generate as much as it wants. An agent that could answer in 50 tokens instead writes 500. Output tokens cost 3–5x more than input tokens on most models.
The pattern: each of these is a 2–3x cost multiplier. Stack two or three of them and you are paying 10–30x what a well-optimized agent would cost.
How to actually track spend
You cannot optimize what you cannot see. The default dashboards from OpenAI and Anthropic are useful but coarse — they show total spend, not which feature or customer is causing it. For any agent you plan to run in production, plug in one of these:
- Helicone: Proxy-based logging. One-line integration, gives you per-user, per-feature, per-prompt cost breakdowns. Free tier covers most small deployments. Our default for client projects.
- LangFuse: Self-hostable, more focused on trace debugging than pure cost analytics, but the cost data is there and the self-host option matters for EU data residency.
- OpenRouter: If you are already routing through OpenRouter for multi-model access, its usage dashboard gives you a single view across providers.
Whichever tool you pick, the non-negotiable is tagging every request with the feature name and a stable user ID. Without tags, when a bill spikes you cannot tell whether it was one runaway user, a bad deploy, or a traffic increase. With tags, the answer is one query away.
The quick wins
In order of impact, these are the changes that consistently cut costs on real projects:
- Prompt caching on system messages: If your system prompt is over 1,024 tokens (Anthropic's minimum for caching), turn caching on. Cost of cached tokens drops to 10% of normal. For an agent with a 3,000-token system prompt and 10,000 daily calls, this alone saves roughly $150–250 per month. This is usually the single biggest win.
- Model routing: Use Haiku or GPT-4o-mini for classification, entity extraction, routing decisions, and simple Q&A. Reserve Opus or GPT-4o for the actual reasoning steps. A two-tier setup typically cuts costs by 60–70% with no noticeable quality drop.
- Output token limits: Set max_tokens aggressively. For a classification agent, 50 tokens is plenty. For a Q&A agent, 500. The default of "no limit" is how bills run away.
- Streaming with early stop: When the model is generating a long response and you can detect it is going off-rails or answering a different question, stop the stream. Saves the rest of the output tokens. Especially effective for tool-use agents.
- Conversation summarization: After 8–10 turns, summarize the history into a 200-token recap and replace the full log. Keeps context cost bounded.
- Semantic caching for FAQs: If 30% of your queries are variations of the same 20 questions, embed them and cache responses. A Redis cache with embedding similarity lookup can serve these at near-zero cost.
Real numbers from real agents
To give you calibration, here is what typical N40 client deployments look like in terms of monthly spend:
- Customer support agent, 2,000 conversations/month, Claude Haiku + Sonnet routing, prompt caching on: $90–$140/month.
- Sales qualification agent, 5,000 conversations/month, GPT-4o-mini + GPT-4o routing, aggressive output limits: $180–$280/month.
- Internal knowledge agent with RAG, 500 queries/month, Claude Sonnet, semantic caching for common questions: $40–$80/month.
- Content generation agent, 1,000 long-form outputs/month, Claude Opus: $300–$450/month. This is the one that tends to be most expensive per call, because output tokens dominate.
The pattern across all of these is the same: costs scale sub-linearly with volume if the optimizations above are in place, and they scale linearly (or worse) if they are not. The difference at 10,000 conversations a month is the difference between a $400 bill and a $4,000 bill for the same functionality.
When to stop optimizing
One honest caveat: optimization has diminishing returns. Going from "no optimization" to "basic caching and routing" typically cuts costs by 70–80%. Going from "basic" to "heavily optimized" cuts another 20–30%. After that, you are spending engineer time to save dollars, and the math stops working. The right mental model is: optimize until the bill is a small fraction of the value the agent delivers, then stop. For most mid-market deployments, that line sits somewhere around $300–800 per month per agent.
If your AI agent bill has started growing faster than your usage, or you cannot tell where the money is going, that is exactly the kind of cleanup we do at N40 — usually a one-week audit followed by the fixes. Start a conversation at /contact.
