Is it cheaper to run an LLM on-prem or in the cloud?

For steady, high-utilization workloads, owned hardware in colocation is almost always cheaper over a 2–3 year window. The break-even point typically lands at 12–18 months once you're running GPUs above roughly 40% utilization. Cloud stays cheaper for spiky, low-utilization, or short-term workloads where you can't keep the hardware busy.

What is the true total cost of ownership for an on-prem LLM server?

TCO includes hardware amortization (an 8x H100 server is ~$300K over 3 years ≈ $8,300/month), colocation power and rack fees ($7,000–12,000/month for a dense rack), networking, and managed support. All-in, a dense GPU rack runs roughly $16,000–22,000/month — versus $150,000+/month for equivalent cloud GPU capacity at steady utilization.

At what cloud spend does owned hardware start making sense?

The rule of thumb is $15,000–20,000/month in consistent cloud GPU spend. Below that, the capital commitment usually isn't worth it. Above it — especially past $50,000/month — you're typically leaving significant money on the table by continuing to rent.

What hidden cloud costs do people forget?

Data egress fees, cold-start latency from reloading model weights, GPU availability gaps during peak demand, and the fact that on-demand pricing assumes you're not committed to reserved instances. These rarely show up in the headline per-hour rate but materially change the comparison.

On-Prem LLM Deployment vs. Cloud: The Real Cost Breakdown

“Should we keep renting cloud GPUs or buy our own hardware?” is one of the most common questions we get — and it’s almost always asked with incomplete numbers on at least one side. Cloud bills hide costs in egress and cold starts. On-prem estimates forget power, cooling, and who fixes a dead GPU at 3am.

This guide lays out the full cost picture for running an LLM both ways, so you can do the comparison honestly for your own workload.

The headline numbers everyone starts with

Here’s the comparison most teams run first, and it’s not wrong — it’s just incomplete.

A single H100 instance on a major cloud runs roughly $30–35/hour on-demand. Run one GPU around the clock at 80% utilization and you’re at $21,000–24,000/month per GPU. An eight-GPU node, the typical unit for serious LLM serving, lands near $170,000–190,000/month.

For comparison, you can buy an 8× H100 SXM5 server outright for around $280,000–350,000. At $170K/month in cloud spend, the hardware pays for itself in under two months on paper.

That math is real, but “on paper” is doing a lot of work. Owning hardware comes with costs the cloud bundles into that per-hour rate. Let’s add them.

What owned hardware actually costs

When you own the server, you take on the operating costs the cloud was quietly handling. Here’s the full monthly picture for a dense GPU rack in colocation:

Cost line	Monthly figure
Hardware amortization (8× H100, ~$300K over 36 months)	~$8,300
Colocation cabinet space	$500–1,500
Power (30–40kW at $150–250/kW)	$5,000–10,000
Cross-connect / connectivity	$300–600
Managed support / operations	$2,000–5,000
All-in total	~$16,000–25,000/month

So the honest comparison isn’t “$300K once” versus “$170K/month.” It’s roughly $16,000–25,000/month all-in for owned, versus $150,000+/month for equivalent cloud capacity at steady utilization. The owned path still wins decisively — but by a 7–10x margin, not the 80x the napkin math implied.

We break down each of these colocation line items further in On-Prem vs. Colocation vs. Cloud for AI Workloads.

A note on “on-prem” vs. “owned in colocation”

People say “on-prem” to mean “our hardware,” but literal on-premises deployment — the server in your office or building — is usually the wrong version of owned.

Office buildings aren’t wired for 10–12kW per server or engineered to dissipate that heat. Solving power, cooling, physical security, and 2am failure response yourself is a real cost that rarely shows up in spreadsheets, and it’s almost always higher than renting a cabinet in a facility built for it.

For nearly every team below large-enterprise scale, “owned hardware in colocation” captures the cost advantage of ownership without the operational tax of running a facility. When this guide compares “on-prem” to cloud, that’s the version of on-prem we mean. The pure office-closet version only makes sense under strict data-sovereignty constraints.

The cloud costs that don’t show up on the quote

The per-hour rate is the visible cost. These are the ones that change the comparison:

Egress fees. Pulling model outputs and data back to your application layer is metered. At volume, it’s a line item, not a rounding error.
Cold starts. Every time a new instance spins up, your model weights — potentially 140GB+ for a 70B model — reload before you serve a single token. You pay for that idle load time, and your users feel the latency.
Availability gaps. During peak demand, your preferred instance type may simply not be available. The cost of “we couldn’t scale when we needed to” doesn’t appear on any invoice.
Reserved-instance lock-in. The cheaper cloud rates require 1–3 year commitments. Once you’re committing for that long anyway, the flexibility argument for cloud — its main advantage — largely evaporates.

The break-even, honestly

Utilization is the whole game. Owned hardware is a fixed cost; cloud is variable. The more you keep the GPUs busy, the better owning looks.

Below ~40% utilization: Cloud often wins. You’re paying to own idle silicon.
40–70% utilization, sustained 12+ months: Owned hardware typically breaks even at 12–18 months and saves meaningfully after.
70%+ utilization, 24/7: Owning is decisively cheaper. The capital cost is recovered fast and everything after is margin.

The practical threshold we tell teams: once you’re spending $15,000–20,000/month consistently on cloud GPUs, it’s worth modeling owned hardware seriously. Past $50,000/month, continuing to rent is usually leaving real money on the table.

This is the same “pay once at sustained utilization” logic that applies all the way down to a single desktop — we worked through that smaller-scale version in AMD’s Agent Computer Pitch. The unit changes; the math doesn’t.

Capex, opex, and the financing wrinkle

There’s a real objection to owning that has nothing to do with the total cost: cloud is operating expense, hardware is capital expense, and a $300K capital outlay hits differently than a monthly bill — especially for a startup watching runway.

This is a genuine consideration, but it’s smaller than it looks for two reasons.

First, financing exists. Most hardware can be spread over 12–36 month terms through partner lenders, which converts the capital outlay back into a monthly payment. That payment is already baked into the amortization figures above — you’re not writing a single $300K check unless you choose to.

Second, the opex you’re comparing against is enormous. Trading a $170K/month operating expense for a $16–25K/month one is the kind of opex reduction that extends runway, not shortens it. The capex-versus-opex framing matters most when the totals are close. Here they aren’t.

The one place the framing genuinely bites: if you’re pre-revenue and pre-product-market-fit, tying up capital — even financed — in fixed infrastructure is the wrong call regardless of the math. Flexibility is worth more than efficiency when you don’t yet know what you’re building. That’s a strategic judgment, not a spreadsheet one.

A worked example

Concrete numbers make the comparison real. Say you’re serving a 70B model in production, GPUs busy ~60% of the day, currently on a 4-GPU cloud setup.

Cloud, today: Four H100-class instances at ~$32/hour, 60% utilization, runs roughly $56,000/month — before egress. Call it $60,000/month all-in. Over 24 months: $1.44M.

Owned, in colocation: A Rivram Trail Boss-class 4-GPU node, amortized hardware plus full colocation overhead and managed support, runs roughly $14,000–18,000/month. Over 24 months: ~$340,000–430,000.

Same workload, same two years: north of $1M in difference. Even if our colocation estimate is off by 30% and the cloud number is generous, the gap doesn’t close — it’s structural, driven by paying variable-cost rates for fixed-cost usage. That’s the whole argument in one example.

Matching hardware to the spend

If the numbers point toward owning, the next question is what to buy — and the answer should track your workload, not the biggest GPU you can afford. A few reference points from our rack bundles:

Rivram Seed (1× L40S) — a first production node for models up to ~13B. The cheapest way to get off the cloud meter for a single workload.
Rivram Trail Boss (4× L40S) — 70B-class inference and multi-tenant API serving, where most teams crossing the $20K/month cloud line land.
Rivram Titan (8× H100/H200/B200, custom build) — frontier-model and high-throughput batch workloads.

Overbuying is its own cost. We see teams buy 8× H100 nodes to “future-proof” when more L40S nodes would have served their actual workload at lower cost-per-request.

Doing this for your own numbers

The honest comparison depends on your specific model, utilization, and electricity rate — there’s no universal answer. The framework is straightforward, but the inputs are yours:

Pull your actual average GPU utilization, not your peak.
Use your real cloud bill including egress, not the headline instance rate.
Amortize realistic hardware cost over 36 months.
Add full colocation overhead — power, space, connectivity, support.
Compare the monthly totals over a 24–36 month window.

If you’re past the $15K/month mark and want this modeled properly for your workload, that’s exactly what our planning service does — TCO modeling is one of its core deliverables. Get in touch and we’ll run the numbers with you, including the ones cloud invoices tend to bury.