You’ve decided to stop paying cloud GPU bills and own your own inference infrastructure. Good call — at scale, the economics are compelling. But the path from “we should do this” to “our rack is live” is more complicated than most teams expect.
This is the planning guide we wish existed when we were helping our first clients navigate this. Here’s how to think through it.
Step 1: Define Your Workloads Precisely
Vague workload definitions lead to wrong hardware. Before you look at a single server spec sheet, you need to answer:
What model(s) are you running?
- LLM inference (Llama 3, Mixtral, GPT-4-class)? What parameter count?
- Image generation (Stable Diffusion, FLUX)?
- Video generation?
- Custom-trained models?
What are your latency requirements?
- Interactive latency (< 500ms time-to-first-token) requires different hardware configuration than batch processing (throughput matters more than latency)
- What’s your acceptable p99 latency?
What’s your throughput target?
- Requests per second? Tokens per second?
- What’s your current volume vs. projected volume in 12 months?
What’s your context length?
- Running 8K context or 128K context dramatically changes memory requirements
- A 70B parameter model at 128K context requires ~160GB of VRAM minimum — that’s 2x H100 SXM5 just to hold the KV cache
Getting precise on these answers before hardware selection saves you from buying the wrong thing.
Step 2: Right-Size Your GPU Selection
The GPU selection flows directly from your workloads. Here’s a rough guide:
NVIDIA H100 SXM5 (80GB HBM3) Best for: Large LLM inference (70B+ models), training, high-throughput batch workloads When to choose it: You need maximum memory bandwidth and NVLINK bandwidth for multi-GPU inference
NVIDIA A100 (80GB HBM2e) Best for: Production LLM inference, slightly older models, cost-sensitive deployments When to choose it: H100s are out of budget or lead time is too long; still excellent for most production workloads
NVIDIA L40S (48GB GDDR6) Best for: Smaller models (7B–13B), image generation, video inference, mixed workloads When to choose it: You don’t need the raw memory bandwidth of HBM, or you have mixed inference types
A common mistake: Teams overbuy because the largest GPU “future-proofs” them. In practice, you may be better off with more L40S nodes than fewer H100 nodes — depending on your workloads, the throughput and cost-per-request math often favors scale-out over scale-up.
Step 3: Calculate Your Power Requirements
This is the step most first-timers skip — and it’s the one that causes the most problems at deployment time.
Rule of thumb: A fully loaded GPU server draws 2–3x its GPU TDP under sustained load.
- 8x H100 SXM5: GPU TDP = 700W × 8 = 5,600W in GPUs. Full server power draw: 10,000–12,000W
- 8x A100: GPU TDP = 400W × 8 = 3,200W. Full server draw: 6,500–8,500W
- 8x L40S: GPU TDP = 350W × 8 = 2,800W. Full server draw: 5,500–7,000W
What this means for your colocation order: When you request power from a colo facility, they’ll ask for your power commitment in kW. Don’t underestimate this.
For 2 servers of 8x H100s: budget for 25–30kW of power. That includes server draw plus overhead for cooling, networking, and some buffer.
Tip: Facilities charge more per kW if you exceed your committed power. Overcommit slightly rather than undercommit.
Step 4: Networking Architecture
Small deployments (1–4 GPU servers) can usually run on standard 25GbE or 100GbE Ethernet. Single-node inference doesn’t require special interconnects.
Once you’re doing multi-node inference — distributing a single model across multiple servers — networking becomes critical:
- InfiniBand NDR (400Gb/s): The highest-performance option for distributed inference. Required if you’re running models that need NVLink-class bandwidth across servers. Expensive.
- 100GbE/400GbE Ethernet: Sufficient for most distributed workloads and much more cost-effective than InfiniBand for inference (vs. training).
- RDMA over Converged Ethernet (RoCE): Gets you high-performance networking over Ethernet infrastructure. Good middle ground.
For most Austin startups doing LLM inference, 100GbE Ethernet with RoCE is the right starting point. InfiniBand is worth it when you’re scaling to very large models across many nodes.
Step 5: Choose Your Colocation Facility
Not all colo facilities are equal for AI workloads. Things to evaluate:
Power density support: Can the facility support 15–20kW per cabinet? Many older facilities are designed for 2–5kW/cabinet average. AI racks need much higher density.
Cooling infrastructure: High-density AI racks require hot-aisle/cold-aisle containment at minimum. Liquid cooling (direct-to-chip or rear-door heat exchangers) is increasingly important for the densest deployments.
Fiber and connectivity: What carriers are on-net? What’s the cost per cross-connect? What’s the total bandwidth available to your cabinet?
Location: Proximity to your team matters for hands-on support. In Austin, there are several excellent options within the metro area.
A note on relationships: If you’re working with a system integrator (like Rivram), they likely have existing relationships with local facilities. That can mean faster provisioning, better pricing, and smoother escalation paths when there are issues.
The Mistakes We See Most Often
Buying hardware before validating power commitments. Lead times on H100s can be 12–20 weeks. Power provisioning at a colo facility can take 4–8 weeks. Start both in parallel — don’t wait for hardware to arrive to start the power conversation.
Underestimating network requirements. “We’ll start with 1GbE and upgrade later” always costs more than getting it right initially.
No hardware management plan. Who monitors the hardware for failures? Who handles a failed GPU at 3am? If the answer is “we’ll figure it out,” that’s a problem waiting to happen.
Buying the wrong storage. LLM inference requires fast model weight loading. 7.5TB of model weights need to load in seconds, not minutes. Plan for NVMe-backed storage.
What a Good Deployment Plan Looks Like
Before you commit to any hardware purchases, you should have:
- Defined workloads with specific latency and throughput targets
- GPU selection validated against those workloads
- Power calculations completed and power reserved at your target facility
- Networking architecture designed
- Hardware BOM finalized with sourcing timeline
- Deployment and management plan in place
If you’ve got most of these but not all, talk to us. The planning conversation is free — and it’s usually the most valuable 30 minutes before a big infrastructure decision.