Your team operates a production machine learning inference service that processes customer requests in real-time. Over the past week, users have reported increased latency, with P95 response times jumping from 200ms to 800ms. The service handles model inference for a computer vision application that classifies images uploaded by users. The system processes approximately 10,000 requests per minute during peak hours across a cluster of GPU-enabled inference servers.
As the infrastructure engineer on call, you need to systematically diagnose the performance degradation and propose solutions to restore acceptable latency levels. The inference pipeline consists of multiple stages: request ingestion, preprocessing, model inference, postprocessing, and response delivery. No recent deployments have been made to the model or application code, though traffic has grown by approximately 15% over the last month.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Interviewers want to see a structured approach to identifying bottlenecks rather than random guessing. They're evaluating whether you understand observability principles and can methodically narrow down root causes in a complex distributed system.
Hints to consider:
For ML infrastructure roles, understanding GPU utilization patterns and bottlenecks is critical. Interviewers look for knowledge of how inference workloads interact with hardware accelerators.
Hints to consider:
Efficient request batching is fundamental to ML inference performance. Interviewers expect candidates to understand the tradeoffs between latency and throughput.
Hints to consider:
While this is an infra role, understanding model-level optimizations shows depth. Interviewers may probe whether you know when to involve ML engineers versus when infra changes suffice.
Hints to consider:
Interviewers want to see how you think about horizontal vs. vertical scaling and resource allocation strategies for ML workloads.
Hints to consider:
Start by understanding the system's current state and recent changes. Ask about the architecture (monolithic service vs. microservices), the specific model being served (size, complexity, framework), and the infrastructure stack (cloud provider, orchestration, GPU types). Confirm the baseline performance metrics before the degradation occurred. Inquire about any recent changes to traffic patterns, data distribution, or infrastructure updates. Establish which metrics are already being collected and what observability tools are available.
Propose adding detailed instrumentation to break down latency by pipeline stage: request queuing time, preprocessing duration, GPU inference time, postprocessing duration, and network transmission time. Set up distributed tracing if not already present to track individual requests through the system. Create dashboards showing P50, P95, and P99 latencies for each stage over time. Monitor GPU-specific metrics including utilization percentage, memory usage, SM (streaming multiprocessor) occupancy, and kernel execution time. Track system-level metrics like CPU usage, memory pressure, network bandwidth, and disk I/O.
Use the instrumentation to pinpoint which stage(s) show degraded performance. For GPU bottlenecks, check if utilization is low (indicating batching issues or overhead) or high (indicating saturation). Examine whether GPU memory is being exhausted, causing swapping or OOM errors. For CPU bottlenecks, profile preprocessing code to identify hot paths. Check for resource contention—are multiple processes competing for the same GPUs? Analyze request queue depths to determine if the system is under-provisioned. Compare current traffic characteristics (image sizes, batch distributions) against baselines to identify anomalies.
Based on bottleneck analysis, propose targeted solutions. If GPU utilization is low, implement or tune dynamic batching with appropriate timeout thresholds. If GPUs are saturated, consider horizontal scaling or upgrading to more powerful GPU instances. For memory issues, explore model quantization or reducing batch sizes. If preprocessing is slow, parallelize operations or move them to separate CPU workers. Implement multi-model serving or model replicas if a single model instance is the bottleneck. For queueing issues, add load shedding or priority queues. Discuss how you would A/B test changes in production, monitor for regressions, and gradually roll out optimizations while maintaining reliability.