System Design - Design a Network Throttling System
[ OK ]f1039466-62dc-4637-a26a-9294c3ceaa54 — full content available
[ INFO ]category: System Design difficulty: unknown freq: first seen: 2026-03-13
[UNKNOWN][SYSTEM DESIGN]
$catproblem.md
In a typical Databricks Software Engineer interview, the Network Throttling System is a core system design problem used to evaluate a candidate's ability to handle high-throughput, low-latency, and multi-tenant constraints.
Problem Statement: Network Throttling System
The Scenario:Databricks operates a massive multi-tenant platform where thousands of clients execute data processing jobs and API calls simultaneously. To ensure system reliability and prevent a single "noisy neighbor" from saturating the shared network bandwidth, you must design a Network Throttling System to regulate data flow per client.
Core Requirements:
Per-Client Quotas: Each client (e.g., a specific workspace or user ID) has a configured maximum throughput (bandwidth).
Throughput Tracking: Track usage over sliding windows to ensure limits are not exceeded over time.
Burst Handling: Allow idle clients to "burst" above their limit for a short duration, while ensuring consistently high-usage clients are strictly capped.
Action on Limit: When a client exceeds their limit, the system should either drop packets/requests or queue them to be released when capacity becomes available.
Low Latency & High Scale: The enforcement mechanism must add minimal overhead (< 10ms) and handle millions of requests per second.
Common Interviewer Constraints & Follow-ups
Databricks interviewers often push the discussion into the following technical areas:
Algorithm Choice: Compare the trade-offs between Token Bucket (good for bursts), Leaking Bucket (smooths flow), and Sliding Window Log/Counter (high precision).
Distributed Enforcement: How do you maintain an accurate global count across a distributed fleet of servers? (e.g., using a centralized cache like Redis vs. local in-memory counting with eventual consistency).
Concurrency: How do you handle race conditions when multiple threads or machines update the same client's usage counter simultaneously? (e.g., atomic increments, locks, or Lua scripts).
Fault Tolerance: If the centralized rate-limiting service or cache goes down, should the system "fail open" (allow all traffic) or "fail closed" (block all traffic)?.
Databricks Specifics: How would this system integrate with a Lakehouse architecture or protect specific RatelimitGroups (RLG) like API endpoints?.
If you'd like, I can provide a detailed technical solution for this problem or a comparison of the best algorithms to use during the interview.