Design Distributed Model Deployment

[ OK ] 334 — full content available

[ INFO ] category: System Design difficulty: hard freq: medium first seen: 2026-01-13

[HARD][SYSTEM DESIGN][MEDIUM]data_engineeringDistributed SystemsML Infrainfrastructuremachine_learningSystem Design

$ cat problem.md

Design a system to deploy a 500 GB machine-learning model to 100–1000 GPU workers spread across multiple data-centers. The model is stored in cloud object storage (10 Gbps external bandwidth per site). Each worker has 100 Gbps of internal bandwidth to peers inside the same DC and 10 Gbps to the external internet. Workers are stateful: they are actively serving live inference traffic and cannot be taken offline during the update. Some workers have only enough GPU memory to hold one copy of the model. The service must meet an SLO of <30 s p99 download time and <5 min end-to-end rollout, support canary and staged rollouts, automatic rollback if the new model produces bad outputs, and zero dropped requests. Outline the high-level architecture, the distribution protocol, the rollout strategy, and the rollback mechanism.

user@intervues:~/anthropic$