Practice/Databricks/Design a Solution for the Noisy Neighbor Problem
Design a Solution for the Noisy Neighbor Problem
System DesignMust
Problem Statement
Build a cloud infrastructure platform that enables hundreds of customers to safely share the same physical hardware without interfering with each other's performance. Your system must prevent "noisy neighbor" scenarios where one tenant's burst of CPU usage, disk I/O, or network traffic degrades the experience of co-located tenants. The platform runs diverse workloads from batch jobs to latency-sensitive services, all competing for CPU cores, memory, disk bandwidth, and network capacity on shared hosts.
The challenge is to maintain fair resource allocation and enforce service-level agreements while maximizing infrastructure utilization. Your design should detect violations in real time, automatically respond with throttling or migration, and provide operators with visibility into resource contention patterns. Consider how AWS EC2, Google Compute Engine, or Databricks serverless clusters isolate thousands of tenants on commodity hardware while guaranteeing predictable performance.
Key Requirements
Functional
- Tenant isolation -- guarantee that one tenant's resource consumption cannot starve or significantly degrade another tenant's workload
- Real-time monitoring -- continuously measure per-tenant CPU, memory, disk I/O, and network usage with sub-second granularity
- Automated enforcement -- detect policy violations and automatically throttle, reschedule, or migrate offending workloads
- Quota management -- allow administrators to set and update per-tenant limits and priority levels across all resource types
- Transparency -- provide tenants and operators with dashboards showing current usage, throttling events, and historical trends
Non-Functional
- Scalability -- support 10,000+ hosts with 50+ tenants per host, generating 1M+ metrics per second
- Reliability -- continue enforcing isolation on hosts even when central control plane is unreachable; no single point of failure
- Latency -- detect violations within 2 seconds and begin mitigation within 5 seconds to prevent SLA breaches
- Consistency -- eventual consistency acceptable for quota updates; strong consistency required for admission control decisions
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Resource Isolation Mechanisms
Interviewers want to see that you understand the low-level operating system and hypervisor features that actually enforce isolation, not just high-level orchestration concepts.
Hints to consider:
- Discuss Linux cgroups (cpu, memory, blkio) for container-level limits and how they differ from VM-level hypervisor controls
- Explain I/O scheduling classes (CFQ, deadline) and how you'd prioritize disk operations when multiple tenants hit the same spindle
- Consider network isolation via tc (traffic control), eBPF programs, or NIC-level rate limiting with SR-IOV
- Address the tradeoff between hard limits (guaranteed protection but potential waste) versus soft limits (better utilization but weaker guarantees)
2. Detection and Feedback Loops
The system must identify noisy neighbors quickly enough to prevent SLA violations, which requires careful metric collection and anomaly detection architecture.
Hints to consider:
- Design a hierarchical monitoring system with host-level agents shipping metrics to regional aggregators to avoid overwhelming a central service
- Explain how you'd use sliding windows and percentile calculations to detect abnormal resource spikes versus expected variability
- Discuss correlation across multiple resource dimensions since a noisy neighbor might exhaust network bandwidth while CPU looks normal
- Consider the feedback loop latency budget -- from metric collection through detection, decision-making, and enforcement back on the host
3. Multi-Step Mitigation Strategies
Remediation isn't instantaneous; it involves coordinated actions across multiple systems while maintaining overall platform stability.
Hints to consider:
- Describe a tiered response: first warn and throttle slightly, escalate to hard throttling, then consider live migration as a last resort
- Explain how you'd prevent thrashing where the system constantly moves workloads between hosts without resolving the underlying contention
- Discuss coordination mechanisms to ensure two controllers don't simultaneously try to migrate the same VM or make conflicting throttling decisions
- Address rollback and safety: how do you undo a mistaken throttling action if your detection had a false positive?
4. Quota and Admission Control
Preventing resource contention begins at workload admission time, before tenants even start consuming resources.
Hints to consider:
- Design a distributed admission control system that checks available capacity across multiple resource dimensions before placing a workload
- Explain how you'd handle burst credits or overcommitment ratios that allow tenants to temporarily exceed quotas during low-contention periods
- Discuss priority classes where premium tenants get stricter guarantees while best-effort tenants can be preempted during contention
- Consider the consistency challenge: admission decisions must be based on recent capacity data, but you can't afford synchronous locks across thousands of hosts
5. Observability and Closed-Loop Control
Operators need visibility into system behavior, and the platform must continuously tune itself based on observed patterns.
Hints to consider:
- Design dashboards that show per-tenant resource consumption trends, throttling frequency, and proximity to quota limits
- Explain how you'd capture detailed traces of mitigation actions for post-incident analysis when a tenant reports degraded performance
- Discuss feedback mechanisms where the system learns optimal thresholds by observing which throttling actions successfully resolved contention
- Consider the cold-start problem: how do you set initial policies for new tenant workloads before you have historical usage data?