Network and Linux Fundamentals (SRE)

[ OK ] 318 — full content available

[ INFO ] category: System Design · Domain Specific difficulty: medium freq: high first seen: 2026-01-13

[MEDIUM][DOMAIN SPECIFIC][HIGH]InfrastructureNetworkingLinuxSREDomain

$ cat problem.md

You are the on-call Site Reliability Engineer for TikTok’s CDN edge-pop in Frankfurt. At 03:17 UTC you receive a PagerDuty alert: “Error rate on cache-fra-42 spiked from 0.3 % to 18 %; 5xx from origin climbs; median 95-th percentile latency doubled.” SSH into the box succeeds, but every interactive command takes >5 s to return. Your runbook demands a 5-minute live diagnosis followed by a 3-minute executive summary on the incident bridge. Walk the interviewer through the exact commands you type, in order, and the signals you watch to decide whether the fault is (a) DNS resolution, (b) exhausted ephemeral ports, (c) disk saturation on the NVMe cache volume, (d) memory pressure triggering OOM kills of the nginx workers, or (e) upstream backbone loss. After you identify the root cause, give the single sysctl or systemd command you would run to mitigate within the next 60 seconds without restarting any service. You must justify every step with the output you see and state the risk of your mitigation.

user@intervues:~/tiktok$