Cold start latency on GPU cloud platforms in 2026 — p99 specifically, not p50. Anyone have real data? [D]
doing infrastructure evaluation for inference workloads and running into the same problem everywhere: every platform publishes p50 cold start claims or median startup times. nobody publishes p99. and p99 is the number that shows up in support tickets and SLA violations, not p50
what I’m specifically trying to understand:
how does cold start p99 behave under load vs normal conditions — is there meaningful degradation when providers are at high utilization?
does multi-provider pooling actually improve p99 or just p50? the logic seems sound (route to where capacity exists) but I haven’t found published data
how much of cold start is infrastructure queue time vs model loading time? I suspect these are often conflated in marketing claims
for context: running inference workloads on 70B-class models, RTX 5090 and H200 primarily, care deeply about p99 because user-facing latency
anyone have real numbers or methodology for measuring this properly?
[link] [comments]
Want to read more?
Check out the full article on the original site