Debug/Incident Response

[ OK ] 491 — full content available

[ INFO ] category: System Design · Domain Specific difficulty: medium freq: medium first seen: 2026-01-13

[MEDIUM][DOMAIN SPECIFIC][MEDIUM]DebuggingIncident ResponseSystem Operations

$ cat problem.md

You are the on-call Site Reliability Engineer for Discord’s voice infrastructure. At 02:17 UTC you receive a P0 page: "Users cannot join voice channels in US-East; connection success rate dropped from 99.8 % to 12 % in the last 10 minutes." Your task is to restore service and then explain how you would prevent recurrence. Walk the interviewer through exactly what you do in the first 30 minutes: how you classify severity, what dashboards you open, what logs/metrics/traces you query, how you decide whether to rollback the most recent deployment or fail-over to a secondary region, how you communicate status internally and on Twitter, and how you verify the fix. After mitigation, outline the post-incident review you will run: the timeline template you’ll fill, the root-causing tools you’ll use, and the systemic improvements you’ll propose.

user@intervues:~/discord$