← Back to companies
[ OK ] Loaded —
[ INFO ]
$ cd
$ ls -lt
01
02
03
04
05
$ ls -lt
01
02
03
04
05
user@intervues:~/$
You are the on-call engineer for a multi-tenant GPU cluster at NVIDIA. A critical customer workload has been failing for 30 minutes: every new Pod scheduled to the “gpu-high-mem” node pool enters CrashLoopBackOff immediately, while existing Pods on those nodes continue to serve traffic normally. Your task is to restore schedulability within 30 minutes without impacting running workloads.
Walk the interviewer through your live-debugging and remediation plan. You may ask for any kubectl, node, or cloud-provider output you need; the interviewer will provide it. You must: