How much VRAM do you need to run a 70B model?

At FP16/BF16, a 70B model needs roughly 140GB just for weights — about 2× H100 80GB before any context. At 8-bit quantization it drops to ~70GB (fits on a single H100 80GB with little room for context). At 4-bit it's ~35–40GB, which fits on one 48GB GPU like an L40S. Add KV-cache headroom for real context lengths on top of all of these.

Can a single GPU run a 70B model?

Yes, with quantization. A 4-bit quantized 70B model fits on a single 48GB GPU (L40S, RTX 6000 Ada) or comfortably on an 80GB H100. Running unquantized FP16 weights requires at least two 80GB GPUs. Single-GPU 70B is fine for moderate throughput; high-concurrency production serving usually wants 4 GPUs.

How does context length change GPU requirements for a 70B model?

Context lives in the KV cache, which grows with both sequence length and concurrent requests. A 70B model at long context (128K) can need 100GB+ of KV cache on top of the weights. This is why production 70B serving often uses 4 GPUs even when the weights alone would fit on fewer — the cache, not the weights, drives the memory budget.

What is the best GPU for 70B model inference in 2026?

For most production teams, 4× NVIDIA L40S 48GB offers the best balance of cost, VRAM, and throughput for 70B-class serving. H100 SXM5 in a 2× or 8× configuration is preferred when you need maximum throughput, long context, or high concurrency. A single 80GB H100 with 8-bit quantization works for lower-volume single-tenant use.

What GPU Do I Need to Run a 70B Model?

“What GPU do I need to run a 70B model?” sounds like it should have a one-line answer. It doesn’t — because the real answer depends on three things most people leave out of the question: what precision you’re running, how long your context is, and how many requests you need to serve at once.

Get those three straight and the hardware choice falls out naturally. Here’s how to think about it.

Start with the weights

A model’s parameters have to fit in GPU memory, and the memory they take depends on the precision you store them at. The rule of thumb: multiply the parameter count by the bytes per parameter.

For a 70B model:

Precision	Bytes/param	Weights alone	Fits on
FP16 / BF16 (full)	2	~140 GB	2× 80GB GPU
INT8 / FP8 (8-bit)	1	~70 GB	1× 80GB GPU (tight)
INT4 / FP4 (4-bit)	0.5	~35–40 GB	1× 48GB GPU

That table is the foundation of every 70B hardware decision. At full precision you need at least two 80GB GPUs just to hold the weights. Quantize to 4-bit and the same model squeezes onto a single 48GB card.

But weights are only half the story.

Then add the KV cache — the part people forget

When an LLM generates text, it keeps a running memory of the conversation so far called the KV cache. It grows with two things: how long your context is, and how many requests you’re serving concurrently. And it’s stored in GPU memory right alongside the weights.

For a 70B model, the KV cache is not a rounding error:

Short context (4–8K tokens), a few concurrent requests: a few GB.
Long context (128K tokens): the cache for a single request can run into the tens of GB. Serve several long-context requests at once and you can need 100GB+ of KV cache on top of the 140GB of weights.

This is the single most common 70B sizing mistake we see. A team checks that 4-bit weights fit on one GPU, deploys, and then falls over the moment real users send long prompts in parallel. The weights fit; the cache didn’t.

The practical takeaway: size for weights plus the KV cache your actual context length and concurrency demand. That’s why production 70B serving often lands on 4 GPUs even when the weights alone would fit on one or two.

Quantization: the lever that changes everything

Quantization — storing weights at lower precision — is the biggest single lever on 70B hardware cost. The tradeoffs:

FP16/BF16 (full): Highest fidelity, highest memory. Use when quality is paramount and you have the GPUs.
8-bit (INT8/FP8): Roughly halves memory with minimal quality loss for most workloads. A strong default. Fits a 70B model on a single 80GB H100 with room for modest context.
4-bit (INT4/FP4): Roughly quarters memory. Quality impact is small for many use cases and noticeable for some. Lets a 70B model run on a single 48GB GPU.

For most production deployments, 8-bit is the sweet spot — large memory savings, negligible quality cost. Go to 4-bit when you’re memory-constrained or cost-sensitive and you’ve validated the quality is acceptable for your specific use case. Don’t assume; test it on your prompts.

So, concretely: which GPU?

Here’s how the precision and concurrency picture maps onto actual hardware. These configurations come straight from how we build rack bundles for clients running 70B-class models.

Single 48GB GPU — L40S or RTX 6000 Ada Runs a 4-bit 70B model for lower-volume, single-tenant inference. The entry point. This is the GPU class in our Rivram Seed and Ranger bundles. Good for internal tools and moderate throughput; not for high-concurrency public serving.

Single 80GB GPU — H100 PCIe/SXM5 Runs an 8-bit 70B model with room for reasonable context. A clean single-GPU production option when you don’t need to fan out across many concurrent long-context requests.

4× L40S 48GB — the production default 192GB of aggregate VRAM gives you full-precision-ish quality via 8-bit weights plus generous KV-cache headroom for real concurrency and context. This is the Rivram Trail Boss configuration, and it’s where most teams serving 70B-class models in production land. Best balance of cost, memory, and throughput.

8× H100 SXM5 — maximum throughput When you need the highest concurrency, the longest contexts, or you’re serving multiple 70B models at once, the Rivram Titan custom build with NVLink-connected SXM5 GPUs is the answer. Overkill for a single moderate workload; exactly right at scale.

You can compare the raw specs across all of these on the hardware catalog.

Throughput, batching, and why one GPU isn’t the answer for everyone

Fitting the model in memory gets you to “it runs.” It doesn’t get you to “it serves your users.” Throughput is a separate question, and it’s driven by batching — how many requests the GPU processes at once.

Modern inference servers (vLLM, TGI, Triton) batch incoming requests together to keep the GPU’s compute units saturated. The catch: every request in a batch needs its own slice of KV cache. So your effective throughput is capped not by raw compute but by how much memory is left for the cache after the weights are loaded.

This is the quiet reason single-GPU 70B setups disappoint under load. The weights fit, but with little memory left for cache, the server can only batch a handful of requests — and throughput per dollar collapses the moment traffic arrives. Spreading the model across 4 GPUs frees up far more aggregate memory for batching, which is why the production default is a multi-GPU node even when a single card “technically works.”

If your traffic is bursty or concurrency is high, size for the batch you need to serve, not the smallest box the weights fit in.

Multi-GPU: when the interconnect starts to matter

Once you’re across multiple GPUs, how they talk to each other becomes a factor — though less than people assume for inference.

Within a single server (the common case), GPUs communicate over PCIe or, on SXM platforms, NVLink. For most 70B inference, standard PCIe between 4 L40S cards is entirely sufficient. NVLink’s 900GB/s GPU-to-GPU bandwidth pays off for the largest models and training, less so for typical inference.
Across multiple servers (multi-node), networking becomes critical and you’re into 100GbE/RoCE or InfiniBand territory. But you only get here when a single 70B model is too large for one box — which, with quantization, it usually isn’t.

The practical implication: for a 70B model, you almost never need to leave a single server, and you rarely need NVLink-class interconnect. A 4× L40S node on PCIe is the un-glamorous, correct answer for most teams. We cover the networking decision in more depth in How to Plan Your First GPU Deployment.

The mistake worth repeating: scale-out often beats scale-up

Teams instinctively reach for the biggest GPU to “future-proof.” For 70B inference, that’s frequently the wrong call. Four L40S nodes can out-serve one H100 node on throughput-per-dollar for many workloads, because inference parallelizes well and you’re rarely bottlenecked on the things SXM5 and NVLink are built for.

Buy for your actual workload — your real context length, your real concurrency, your real quality bar — not the largest model you might theoretically run someday. We dig into this tradeoff further in How to Plan Your First GPU Deployment.

Quick reference

Your situation	Configuration
Internal tool, low volume, cost-sensitive	1× 48GB (L40S), 4-bit
Single-tenant production, moderate context	1× 80GB (H100), 8-bit
Multi-tenant production, real concurrency	4× L40S, 8-bit
High throughput, long context, multiple models	8× H100 SXM5

Get the sizing right before you buy

The wrong 70B configuration costs you twice — once when you overpay for GPUs you didn’t need, and again when you under-provision the KV cache and your serving falls over under load. Both are avoidable with an honest read of your workload.

If you can describe your model, your context length, and your expected concurrency, we can size the hardware precisely. That’s the core of our planning service — reach out and we’ll spec the right configuration for what you’re actually running, not the spec sheet’s idea of it.