Deploy Kubernetes Cluster on New Cloud

[ OK ] 428 — full content available

[ INFO ] category: System Design difficulty: medium freq: medium first seen: 2026-01-13

[MEDIUM][SYSTEM DESIGN][MEDIUM]data_engineeringCloudwebKubernetesInfrastructuremachine_learningbackendinfrastructure

$ cat problem.md

You are starting with a brand-new account on a public cloud provider (AWS, GCP, or Azure). Your task is to take the account from empty to a production-ready Kubernetes cluster that can run GPU-accelerated ML training and inference workloads for internal NVIDIA teams. Walk through the end-to-end design and deployment steps you would follow. Cover (1) cloud account bootstrapping and network foundation, (2) highly-available control-plane setup, (3) GPU-enabled worker node onboarding, (4) cluster add-ons for observability, security, and ingress, and (5) day-2 operations such as upgrades, etcd backup, cost control, and incident response. Assume the cluster must support 500 GPUs across three availability zones, with <5 min MTTR for control-plane failures and <30 min MTTR for worker-node failures. You have 45 minutes: spend the first 5 clarifying requirements, the next 30 presenting your design, and the final 10 discussing trade-offs and failure scenarios.

user@intervues:~/nvidia$