“What GPU do I need to run a 70B model?” sounds like it should have a one-line answer. It doesn’t — because the real answer depends on three things most people leave out of the question: what precision you’re running, how long your context is, and how many requests you need to serve at once.
Get those three straight and the hardware choice falls out naturally. Here’s how to think about it.
Start with the weights
A model’s parameters have to fit in GPU memory, and the memory they take depends on the precision you store them at. The rule of thumb: multiply the parameter count by the bytes per parameter.
For a 70B model:
| Precision | Bytes/param | Weights alone | Fits on |
|---|---|---|---|
| FP16 / BF16 (full) | 2 | ~140 GB | 2× 80GB GPU |
| INT8 / FP8 (8-bit) | 1 | ~70 GB | 1× 80GB GPU (tight) |
| INT4 / FP4 (4-bit) | 0.5 | ~35–40 GB | 1× 48GB GPU |
That table is the foundation of every 70B hardware decision. At full precision you need at least two 80GB GPUs just to hold the weights. Quantize to 4-bit and the same model squeezes onto a single 48GB card.
But weights are only half the story.
Then add the KV cache — the part people forget
When an LLM generates text, it keeps a running memory of the conversation so far called the KV cache. It grows with two things: how long your context is, and how many requests you’re serving concurrently. And it’s stored in GPU memory right alongside the weights.
For a 70B model, the KV cache is not a rounding error:
- Short context (4–8K tokens), a few concurrent requests: a few GB.
- Long context (128K tokens): the cache for a single request can run into the tens of GB. Serve several long-context requests at once and you can need 100GB+ of KV cache on top of the 140GB of weights.
This is the single most common 70B sizing mistake we see. A team checks that 4-bit weights fit on one GPU, deploys, and then falls over the moment real users send long prompts in parallel. The weights fit; the cache didn’t.
The practical takeaway: size for weights plus the KV cache your actual context length and concurrency demand. That’s why production 70B serving often lands on 4 GPUs even when the weights alone would fit on one or two.
Quantization: the lever that changes everything
Quantization — storing weights at lower precision — is the biggest single lever on 70B hardware cost. The tradeoffs:
- FP16/BF16 (full): Highest fidelity, highest memory. Use when quality is paramount and you have the GPUs.
- 8-bit (INT8/FP8): Roughly halves memory with minimal quality loss for most workloads. A strong default. Fits a 70B model on a single 80GB H100 with room for modest context.
- 4-bit (INT4/FP4): Roughly quarters memory. Quality impact is small for many use cases and noticeable for some. Lets a 70B model run on a single 48GB GPU.
For most production deployments, 8-bit is the sweet spot — large memory savings, negligible quality cost. Go to 4-bit when you’re memory-constrained or cost-sensitive and you’ve validated the quality is acceptable for your specific use case. Don’t assume; test it on your prompts.
So, concretely: which GPU?
Here’s how the precision and concurrency picture maps onto actual hardware. These configurations come straight from how we build rack bundles for clients running 70B-class models.
Single 48GB GPU — L40S or RTX 6000 Ada Runs a 4-bit 70B model for lower-volume, single-tenant inference. The entry point. This is the GPU class in our Rivram Seed and Ranger bundles. Good for internal tools and moderate throughput; not for high-concurrency public serving.
Single 80GB GPU — H100 PCIe/SXM5 Runs an 8-bit 70B model with room for reasonable context. A clean single-GPU production option when you don’t need to fan out across many concurrent long-context requests.
4× L40S 48GB — the production default 192GB of aggregate VRAM gives you full-precision-ish quality via 8-bit weights plus generous KV-cache headroom for real concurrency and context. This is the Rivram Trail Boss configuration, and it’s where most teams serving 70B-class models in production land. Best balance of cost, memory, and throughput.
8× H100 SXM5 — maximum throughput When you need the highest concurrency, the longest contexts, or you’re serving multiple 70B models at once, the Rivram Titan custom build with NVLink-connected SXM5 GPUs is the answer. Overkill for a single moderate workload; exactly right at scale.
You can compare the raw specs across all of these on the hardware catalog.
Throughput, batching, and why one GPU isn’t the answer for everyone
Fitting the model in memory gets you to “it runs.” It doesn’t get you to “it serves your users.” Throughput is a separate question, and it’s driven by batching — how many requests the GPU processes at once.
Modern inference servers (vLLM, TGI, Triton) batch incoming requests together to keep the GPU’s compute units saturated. The catch: every request in a batch needs its own slice of KV cache. So your effective throughput is capped not by raw compute but by how much memory is left for the cache after the weights are loaded.
This is the quiet reason single-GPU 70B setups disappoint under load. The weights fit, but with little memory left for cache, the server can only batch a handful of requests — and throughput per dollar collapses the moment traffic arrives. Spreading the model across 4 GPUs frees up far more aggregate memory for batching, which is why the production default is a multi-GPU node even when a single card “technically works.”
If your traffic is bursty or concurrency is high, size for the batch you need to serve, not the smallest box the weights fit in.
Multi-GPU: when the interconnect starts to matter
Once you’re across multiple GPUs, how they talk to each other becomes a factor — though less than people assume for inference.
- Within a single server (the common case), GPUs communicate over PCIe or, on SXM platforms, NVLink. For most 70B inference, standard PCIe between 4 L40S cards is entirely sufficient. NVLink’s 900GB/s GPU-to-GPU bandwidth pays off for the largest models and training, less so for typical inference.
- Across multiple servers (multi-node), networking becomes critical and you’re into 100GbE/RoCE or InfiniBand territory. But you only get here when a single 70B model is too large for one box — which, with quantization, it usually isn’t.
The practical implication: for a 70B model, you almost never need to leave a single server, and you rarely need NVLink-class interconnect. A 4× L40S node on PCIe is the un-glamorous, correct answer for most teams. We cover the networking decision in more depth in How to Plan Your First GPU Deployment.
The mistake worth repeating: scale-out often beats scale-up
Teams instinctively reach for the biggest GPU to “future-proof.” For 70B inference, that’s frequently the wrong call. Four L40S nodes can out-serve one H100 node on throughput-per-dollar for many workloads, because inference parallelizes well and you’re rarely bottlenecked on the things SXM5 and NVLink are built for.
Buy for your actual workload — your real context length, your real concurrency, your real quality bar — not the largest model you might theoretically run someday. We dig into this tradeoff further in How to Plan Your First GPU Deployment.
Quick reference
| Your situation | Configuration |
|---|---|
| Internal tool, low volume, cost-sensitive | 1× 48GB (L40S), 4-bit |
| Single-tenant production, moderate context | 1× 80GB (H100), 8-bit |
| Multi-tenant production, real concurrency | 4× L40S, 8-bit |
| High throughput, long context, multiple models | 8× H100 SXM5 |
Get the sizing right before you buy
The wrong 70B configuration costs you twice — once when you overpay for GPUs you didn’t need, and again when you under-provision the KV cache and your serving falls over under load. Both are avoidable with an honest read of your workload.
If you can describe your model, your context length, and your expected concurrency, we can size the hardware precisely. That’s the core of our planning service — reach out and we’ll spec the right configuration for what you’re actually running, not the spec sheet’s idea of it.