Kubernetes Debugging and Architecture

[ OK ] 429 — full content available

[ INFO ] category: System Design · Domain Specific difficulty: medium freq: high first seen: 2026-01-13

[MEDIUM][DOMAIN SPECIFIC][HIGH]KubernetesDevOpsDebuggingInfrastructure

$ cat problem.md

You are the on-call engineer for a multi-tenant GPU cluster at NVIDIA. A critical customer workload has been failing for 30 minutes: every new Pod scheduled to the “gpu-high-mem” node pool enters CrashLoopBackOff immediately, while existing Pods on those nodes continue to serve traffic normally. Your task is to restore schedulability within 30 minutes without impacting running workloads.

Walk the interviewer through your live-debugging and remediation plan. You may ask for any kubectl, node, or cloud-provider output you need; the interviewer will provide it. You must:

Identify the root cause layer (DNS, image, runtime, kubelet, device-plugin, network, admission, resource exhaustion, taint, or GPU driver).
Explain every kubectl command you would run and what you expect to see.
Describe how you would validate the fix and confirm new Pods reach Running state.
Outline a post-mortem architecture change (e.g., admission webhook, monitoring alert, or daemon-set hardening) to prevent recurrence.

user@intervues:~/nvidia$