Every few months a client asks us some version of the same question: "should we self-host our LLM instead of paying OpenAI?" The answer is almost always conditional. For some workloads, self-hosting Llama 3.3 or a similar open-weight model on a Hetzner GPU instance is genuinely cheaper, faster, and more defensible from a compliance standpoint. For other workloads, the API is the right choice and self-hosting is a distraction. The art is telling which one you are.
This article lays out the actual decision framework we use. It has four inputs: volume, sovereignty, latency, and quality. Each one can tip the decision, and usually two or three of them need to align before self-hosting is worth it.
The break-even math
Start with cost, because it is the easiest number to calculate. A Hetzner GEX44 instance (RTX 6000 Ada, 48GB VRAM) runs roughly €200/month. A larger GEX130 with dual GPUs costs around €600/month. On either, you can serve Llama 3.3 70B or a similar-class model with vLLM at 2,000–4,000 tokens per second of throughput.
Compare that to API pricing. GPT-4o-mini is $0.15 per million input tokens and $0.60 per million output. Claude Haiku sits at similar levels. At those rates, a GEX44 instance breaks even with the API somewhere between 8 and 15 million tokens per day, depending on your input/output ratio. Below that volume, the API is cheaper. Above it, self-hosting wins on pure cost.
For larger models, the break-even point shifts down. GPT-4o at $2.50/$10.00 per million tokens is expensive enough that a GEX130 running Llama 3.3 70B pays for itself around 3–5 million tokens per day. That is a volume a single serious production agent can hit.
The important caveat: this math ignores engineering time. Standing up a self-hosted LLM properly — not just running Ollama once, but monitoring, auto-scaling, model updates, failover — is 1–2 weeks of initial work and ongoing operational overhead. Budget that honestly.
The sovereignty case
Cost is often not the real reason for self-hosting. Data sovereignty is. For clients in the EU, especially in regulated industries (finance, health, legal, public sector), sending user data to a US-based API triggers a chain of legal and compliance considerations that can be genuinely hard to resolve — Schrems II, Standard Contractual Clauses, US CLOUD Act exposure, and so on. In practice, the cleanest path is often "the data never leaves our EU datacenter."
Hetzner datacenters in Falkenstein, Nuremberg, and Helsinki make that straightforward. An LLM running on a German-hosted GPU instance, queried from a German application, backed by a German DPA with Hetzner — the data residency story is clean. No US subprocessor, no transatlantic data transfer, no awkward conversations with the client's legal team.
Several of our clients chose self-hosting for exactly this reason. Their volume alone would not have justified it on cost grounds, but once you factor in the avoided compliance cost — not having to document transatlantic transfers, not having to explain OpenAI's subprocessor list to auditors — the business case flips.
The latency case
API-based LLMs have an irreducible latency floor. Even on a fast connection, a request to OpenAI from an EU datacenter typically takes 200–400ms before the first token arrives, then streams at 30–60 tokens per second. For most applications, this is fine. For some — real-time voice agents, interactive coding tools, games — it is not.
A locally hosted model can deliver first-token latency under 50ms and stream at 80–150 tokens per second on the right hardware. The difference is not subtle. A voice agent with 50ms latency feels conversational. A voice agent with 400ms latency feels like it is thinking about what to say, because it literally is.
If your application genuinely needs sub-100ms inference, self-hosting is often the only way to get there. This is the case where we see clients self-hosting even at modest volumes.
Quality: is Llama 3.3 good enough?
The honest answer: for most business tasks, yes. Llama 3.3 70B benchmarks close to GPT-4o-mini on MMLU, HumanEval, and most practical tasks. It is notably worse than GPT-4o or Claude Sonnet on complex reasoning and long-context tasks. For classification, extraction, routine Q&A, summarization, and tool-use at moderate complexity, you will not feel the difference. For a research agent that needs to reason through a 100-page document or a coding agent tackling a large refactor, you will.
The working rule we give clients: use self-hosted Llama for the 80% of requests that do not need frontier intelligence, and route the hard 20% to a hosted frontier model. This hybrid pattern captures most of the cost savings without compromising on the tasks that matter.
The infrastructure stack
For a production self-hosted deployment, the stack we use on client projects is:
- Hardware: Hetzner GEX44 (€200/mo) for 7B–13B models or single-GPU 70B at 4-bit quantization. GEX130 (€600/mo) for 70B at higher precision or concurrent model serving.
- Serving: vLLM for production throughput, Ollama for simpler deployments and faster iteration. vLLM gives 2–5x more throughput per GPU for the same model.
- Reverse proxy: Caddy with automatic TLS. Simpler than nginx for this use case, and the automatic cert renewal is one less thing to monitor.
- Observability: Prometheus and Grafana for GPU utilization, request latency, and queue depth. You want to know when you are at capacity before your users tell you.
- Gateway: A thin OpenAI-compatible API shim (vLLM provides this out of the box) so your application code works unchanged if you later decide to switch to or from a hosted API.
This stack is enough to run a production agent with a few thousand daily users from day one. It will need operational attention — model updates, occasional tuning, monitoring — but it is not exotic.
When not to self-host
Self-hosting is wrong in three scenarios:
- Low volume. Under a million tokens a day, the API is almost always cheaper end-to-end once you count engineering and operational time. Self-hosting at that scale is hobbyist work dressed up as strategy.
- Bleeding-edge needs. If your application depends on the latest reasoning models or the newest multimodal capabilities, open-weight models are usually 6–12 months behind. Self-hosting locks you out of that frontier.
- No DevOps capacity. If you do not already have someone who can keep a Linux server alive, self-hosting an LLM is a bad first project. The API is a pay-to-skip-this option, and it is worth paying for.
For everyone else — teams running production AI at scale, with compliance needs, latency requirements, or predictable token volumes — self-hosting on Hetzner is a perfectly reasonable architecture choice in 2026. It is not trendy, it is just cheaper and more defensible.
If you are trying to work out whether the break-even math applies to your situation, or you want the self-hosted stack set up properly, that is work we do regularly at N40. Start a conversation at /contact.
