Capacity Planning Scenarios
vllm-sr-sim answers fleet-planning questions that cannot be resolved
from first principles alone: where to set a split threshold, whether a fleet
will actually meet SLO under real queue dynamics, which GPU type is cheapest
for a given workload, and when to pre-provision the next tier.
GPU unit costs used throughout:
| GPU | $/hr | $/yr |
|---|---|---|
| A10G 24 GB | $1.01 | $8.85 K |
| A100 80 GB | $2.21 | $19.4 K |
| H100 80 GB | $4.02 | $35.2 K |
P99 TTFT = P99(KV-slot queue wait) + mean prefill time. Each KV-cache slot is modelled as a server in an M/G/c queue.
When to split pools — the short version
Before reaching for the simulator, apply this filter:
Heavy-tail service times (agent / long-context)?
→ Split required. Homo cannot meet SLO regardless of GPU count.
ctx ratio R = long_max_ctx / B_short and long-request fraction f:
R ≤ 2× or f > 30% → homo usually cheaper; split for latency isolation only
R ≥ 4× and f < 10% → split cheaper at high traffic (λ > ~100 req/s)
R ≥ 16× and f < 5% → split cheaper at any meaningful traffic level
Everything below is a puzzle the rule of thumb cannot solve on its own.
Puzzle 1 — Where exactly should I split?
Question: the rule says "split" — but at which token threshold?
The optimal threshold depends entirely on the shape of your CDF. Too low and
the long pool handles too much traffic; too high and the short pool's slot
advantage evaporates. The pareto command sweeps every CDF breakpoint and
finds the cost–latency frontier.
vllm-sr-sim pareto \
--cdf data/lmsys_cdf.json --lam 100 --slo 500 --long-max-ctx 65536
LMSYS result (λ=100, A100, homo baseline = $271K / 14 GPUs):
B_short α-short n_s n_l GPUs $/yr saving P99-s P99-l SLO Pareto
---------------------------------------------------------------------------------------
512 63.8% 2 13 15 $ 290K -7.1% 9ms 25ms ✓ ★
1,024 83.1% 2 10 12 $ 232K +14.3% 10ms 36ms ✓ ★
2,048 94.8% 2 7 9 $ 174K +35.7% 12ms 63ms ✓ ★
4,096 98.4% 3 5 8 $ 155K +42.9% 13ms 108ms ✓ ★ ← optimal
8,192 99.7% 4 4 8 $ 155K +42.9% 14ms 212ms ✓ ★
12,288 99.9% 5 3 8 $ 155K +42.9% 14ms 319ms ✓ ★
Insight: B_short=4096 is optimal — 98% of LMSYS traffic fits below it, so the short pool (256 KV slots at max_ctx=4096) is 16× more slot-efficient than the homo pool (16 slots at max_ctx=65536). Result: 14 GPUs → 8 GPUs, −43% cost. Choosing B_short=512 instead would route only 64% of traffic short, leaving most requests in the expensive long pool and costing 7% more than homo.
Azure result (λ=200, A100, homo baseline = $465K / 24 GPUs):
The entire Azure CDF fits within 8192 tokens, so the max ctx ratio is only 2×. Best Pareto point (B_short=3072) saves just 4% — the slot gain is too small to overcome Erlang fragmentation. The value here is latency isolation: the short pool's P99 drops from 26 ms (homo) to 19 ms, useful for tiered SLAs.
Agent result (λ=200, A100, homo baseline = $9293K / 480 GPUs):
B_short saving P99-s P99-l SLO
16,384 +13.3% 69ms 339ms ✓ ← optimal
32,768 +12.9% 86ms 593ms ✗ ← SLO FAIL: long-pool prefill dominates
B_short=16384 (64 KV slots vs 16 for homo) saves 64 GPUs. Above 32768 the long pool receives requests with 300–600 ms prefill times that bust the 500 ms SLO budget entirely — the P99 failure is caused by prefill cost, not queue wait, something only visible in a full simulation.
Puzzle 2 — Why is my agent fleet failing SLO?
Question: 24 H100 GPUs at λ=20 req/s is only ~30% utilisation. The analytics say the fleet is fine. The DES says it is not.
# Homo baseline — looks feasible analytically
vllm-sr-sim optimize \
--cdf data/agent_heavy_cdf.json --lam 20 --slo 1000 \
--gpu-short h100 --gpu-long h100 \
--b-short 65536 --long-max-ctx 65536
# Two-pool — the fix
vllm-sr-sim optimize \
--cdf data/agent_heavy_cdf.json --lam 20 --slo 1000 \
--gpu-short h100 --gpu-long h100 \
--b-short 4096 --long-max-ctx 65536 --verify-top 3
| Config | GPUs | Cost/yr | P99 TTFT | SLO 1000ms |
|---|---|---|---|---|
| Homo 65K ctx | 24 | $845 K | 1 052 ms | ✗ FAIL |
| Two-pool 4K/65K | 25 | $880 K | 17ms / 147ms | ✓ |
Why analytics miss this: the M/G/c model assumes service times are drawn i.i.d. from a distribution with finite variance. Agent requests span 10–300 seconds of service time — a coefficient of variation cv²>>1. A single long request holds a KV slot for minutes, causing other requests to queue behind it even when GPU utilisation appears low. The DES replays the actual arrival sequence and exposes these spikes; Erlang-C does not.
Two-pool solves it by routing the 46% of long requests (>4K tokens) to a dedicated pool where their slow service time cannot block short requests. Cost premium: +4% — essentially free insurance against SLO failure.
Puzzle 3 — Which GPU type is actually cheapest?
Question: A10G is cheap per card but slow. H100 is expensive per card but fast. Which is cheapest for a given workload?
for gpu in a10g a100 h100; do
# homo baseline
vllm-sr-sim optimize --cdf data/azure_cdf.json --lam 100 --slo 500 \
--gpu-short $gpu --gpu-long $gpu --b-short 8192 --long-max-ctx 8192
# two-pool
vllm-sr-sim optimize --cdf data/azure_cdf.json --lam 100 --slo 500 \
--gpu-short $gpu --gpu-long $gpu --long-max-ctx 8192 --verify-top 3
done
Result (Azure, λ=100, SLO=500ms):
| GPU | Layout | GPUs | Cost/yr | P99 TTFT |
|---|---|---|---|---|
| A10G | Two-pool | 19 | $168 K ← cheapest | 155ms / 335ms |
| H100 | Homo | 6 | $211 K | 26 ms |
| A100 | Two-pool | 12 | $232 K | 52ms / 112ms |
| H100 | Two-pool | 7 | $247 K | 13ms / 30ms |
The non-obvious result: A10G two-pool (211K) — a slower GPU wins by using two-pool routing to compensate. This happens because the Azure ctx ratio (8192/4096 = 2×) doubles A10G's KV slots from 64 to 128 per GPU, enough to offset its lower throughput.
The decision depends on your constraint:
| Priority | Choice |
|---|---|
| Minimum cost | A10G two-pool ($168K) |
| Minimum rack space / power | H100 homo (6 GPUs) |
| Best latency | H100 two-pool (13ms P99 short) |
| Long-context / agent | H100 or A100 — A10G's 24GB VRAM limits KV cache |
Puzzle 4 — When do I need to add GPUs?
Question: traffic is growing — at which exact λ do I need to provision the next GPU tier to avoid a reactive SLO violation?
vllm-sr-sim whatif \
--cdf data/azure_cdf.json --slo 500 \
--gpu-short h100 --gpu-long h100 --long-max-ctx 8192 \
--lam-range 25 50 100 150 200 300 400
GPU step thresholds (two-pool, H100):
| λ (req/s) | GPUs | Cost/yr | Add GPUs before reaching… |
|---|---|---|---|
| 25 | 4 | $141 K | λ = 65 |
| 50 | 5 | $176 K | λ = 90 |
| 100 | 7 | $247 K | λ = 130 |
| 150 | 10 | $352 K | λ = 185 |
| 200 | 12 | $423 K | λ = 270 |
| 300 | 18 | $634 K | λ = 370 |
| 400 | 23 | $810 K | — |
Insight: GPU scaling is sub-linear — traffic grows 16× (25→400) but GPUs grow only 5.75× (4→23). Use this table to pre-provision before the step, not after. Waiting until SLO is already violated means at least one traffic tier with degraded P99.
Puzzle 5 — Which router causes SLO violations?
Question: you have sized the fleet correctly. Does the router matter?
# Agent fleet — where the choice is consequential
vllm-sr-sim compare-routers \
--cdf data/agent_heavy_cdf.json --lam 20 --slo 1000 \
--gpu-short h100 --gpu-long h100 \
--n-s 2 --n-l 23 --long-max-ctx 65536 --n-req 5000
Agent fleet (λ=20, n_s=2, n_l=23):
| Router | P99 TTFT | SLO 1000ms |
|---|---|---|
| LengthRouter | 495 ms | ✓ 99.98% |
| CompressAndRoute | 534 ms | ✗ 99.94% |
| RandomRouter | 292 ms | ✓ 100% |
Two surprises:
-
CompressAndRoute violates SLO despite being designed to reduce fleet size. It compresses borderline-length requests and routes them to the short pool; when several arrive together they overwhelm the 2-GPU short pool and spike P99. It is a planning tool — use it to discover a lower GPU count at sizing time, then deploy LengthRouter for production.
-
RandomRouter passes SLO at this fleet size because it spreads load across all 25 GPUs uniformly, diluting the heavy-tail requests across more KV slots. Its P99 is actually lowest (292 ms), but this is fragile: short requests share slots with long ones, so a traffic-mix shift can cause unpredictable latency degradation.
For chatbot workloads (Azure, low utilisation) all three routers pass SLO — the difference only matters for agent or near-saturation fleets.
Puzzle 6 — Does mixing GPU types in the two-pool fleet save money?
Question: short requests are memory-bandwidth bound and cheap to serve; long requests need large KV caches and fast prefill. Can I cut costs by putting cheap GPUs in the short pool and premium GPUs only where they are needed?
# Azure: A10G short + H100/A100 long
vllm-sr-sim optimize \
--cdf data/azure_cdf.json --lam 100 --slo 500 \
--gpu-short a10g --gpu-long h100 --long-max-ctx 8192
# LMSYS 65K-ctx: test whether A100 can meet SLO on the long pool
vllm-sr-sim optimize \
--cdf data/lmsys_cdf.json --lam 100 --slo 500 \
--gpu-short a10g --gpu-long h100 --long-max-ctx 65536
vllm-sr-sim optimize \
--cdf data/lmsys_cdf.json --lam 100 --slo 500 \
--gpu-short a10g --gpu-long a100 --long-max-ctx 65536
Azure λ=100 result:
| Config | GPUs | Cost/yr | P99 short | P99 long |
|---|---|---|---|---|
| All-A100 (baseline) | 12 | $232 K | 52 ms | 112 ms |
| A10G short + H100 long | 12 | $212 K | 155 ms | 30 ms |
| A10G short + A100 long | 15 | $206 K | 155 ms | 112 ms |
LMSYS λ=100 result (max_ctx=65536):
| Config | GPUs | Cost/yr | P99 short | P99 long |
|---|---|---|---|---|
| All-A100 (baseline) | 8 | $155 K | 43 ms | 2 822 ms ✗ |
| A10G short + H100 long | 7 | $141 K | 129 ms | 181 ms ✓ |
| A10G short + A100 long | 9 | $132 K | 129 ms | 2 822 ms ✗ |
Two insights:
-
Azure (short-context): A10G+H100 saves 9% vs all-A100 with the same 12 GPUs. Expensive GPUs land only in the long pool, where context length justifies them; cheap A10Gs handle the 98% short traffic.
-
LMSYS (long-context): A10G+A100 is the wrong pairing — it is 11% cheaper on paper but the A100 long pool cannot meet 500 ms SLO. For requests up to 65536 tokens the A100 prefill takes ~700–2800 ms; H100 halves that with its larger chunk size (1024 vs 512) and lower W. H100 is not a luxury here — it is a correctness requirement for the long pool. Mixing A10G (short) + H100 (long) saves 9% vs all-A100 and fixes the SLO that all-A100 cannot meet.
Puzzle 7 — When should I switch to disaggregated prefill/decode?
Question: prefill is compute-bound (FLOP-intensive); decode is memory-bandwidth-bound (weight-streaming per token). If I run them on separate pools sized independently, which pairing minimises cost?
# Find optimal nP × nD ratio for H100 prefill + A100 decode
vllm-sr-sim disagg \
--cdf data/azure_cdf.json --lam 100 \
--slo-ttft 500 --slo-tpot 100 \
--gpu-prefill h100 --gpu-decode a100 --max-ctx 8192
# Compare: A100 prefill + H100 decode
vllm-sr-sim disagg \
--cdf data/azure_cdf.json --lam 100 \
--slo-ttft 500 --slo-tpot 100 \
--gpu-prefill a100 --gpu-decode h100 --max-ctx 8192
Azure λ=100 result (mean ISL≈1450 tok, mean OSL≈483 tok, TTFT includes 1.8× KV-transfer overhead):
| Config | GPUs | Cost/yr | TTFT | TPOT |
|---|---|---|---|---|
| All-A100 aggregated | 12 | $232 K | 26 ms | — |
| All-H100 aggregated | 6 | $211 K | 8 ms | — |
| H100P + A100D | 7 (1P+6D) | $151 K | 162 ms | 91 ms |
| H100P + H100D | 4 (1P+3D) | $141 K | 162 ms | 45 ms |
| A100P + H100D | 4 (1P+3D) | $125 K ← cheapest | 492 ms | 45 ms |
Insights:
-
Disagg saves 35–46% vs aggregated — at the cost of higher TTFT (KV-transfer overhead adds 1.8× to raw prefill time).
-
A100P + H100D beats H100P + A100D despite H100 costing 1.82× more than A100. The reason: H100 decode workers each handle 2.5× more requests/second than A100 (lower W and faster per-token iteration), so you need far fewer of them (3 vs 6). One cheap A100 handles all prefill at λ=100 req/s. Counter-intuitively, the decode pool is where the premium GPU earns back its cost.
-
When disagg is worth it: high throughput targets where cost efficiency matters more than latency. Disagg cuts per-GPU cost by ~46%, but TTFT climbs to ~500 ms (KV transfer + A100 prefill). If your SLO is ≤ 200 ms, use H100P (TTFT drops to 162 ms) at $141 K — still 33% cheaper than all-H100.
-
When to stay aggregated: low-traffic or latency-critical fleets (< 50 req/s, TTFT SLO ≤ 100 ms). Disagg adds operational complexity (two separate scaling policies, KV-transfer networking) that is not justified below ~100 req/s.
Summary
Cost vs homo at a glance
| Workload | Config | Cost delta | SLO |
|---|---|---|---|
| Azure chat, λ=50 | Two-pool H100 | +25% | ✓ |
| Azure chat, λ=200 | Two-pool H100 | tied | ✓ |
| LMSYS chat, λ=100 | Two-pool H100 | −38% | ✓ |
| Agent-heavy, λ=20 | Homo H100 | — | ✗ FAIL |
| Agent-heavy, λ=20 | Two-pool H100 | +4% |