System Design - Design a Network Throttling System

[ OK ] f1039466-62dc-4637-a26a-9294c3ceaa54 — full content available

[ INFO ] category: System Design difficulty: unknown freq: first seen: 2026-03-13

[UNKNOWN][SYSTEM DESIGN]

$ cat problem.md

In a typical Databricks Software Engineer interview, the Network Throttling System is a core system design problem used to evaluate a candidate's ability to handle high-throughput, low-latency, and multi-tenant constraints.

Problem Statement: Network Throttling System

The Scenario:Databricks operates a massive multi-tenant platform where thousands of clients execute data processing jobs and API calls simultaneously. To ensure system reliability and prevent a single "noisy neighbor" from saturating the shared network bandwidth, you must design a Network Throttling System to regulate data flow per client.

Core Requirements:

Per-Client Quotas: Each client (e.g., a specific workspace or user ID) has a configured maximum throughput (bandwidth).
Throughput Tracking: Track usage over sliding windows to ensure limits are not exceeded over time.
Burst Handling: Allow idle clients to "burst" above their limit for a short duration, while ensuring consistently high-usage clients are strictly capped.
Action on Limit: When a client exceeds their limit, the system should either drop packets/requests or queue them to be released when capacity becomes available.
Low Latency & High Scale: The enforcement mechanism must add minimal overhead (< 10ms) and handle millions of requests per second.

Common Interviewer Constraints & Follow-ups

Databricks interviewers often push the discussion into the following technical areas:

Algorithm Choice: Compare the trade-offs between Token Bucket (good for bursts), Leaking Bucket (smooths flow), and Sliding Window Log/Counter (high precision).
Distributed Enforcement: How do you maintain an accurate global count across a distributed fleet of servers? (e.g., using a centralized cache like Redis vs. local in-memory counting with eventual consistency).
Concurrency: How do you handle race conditions when multiple threads or machines update the same client's usage counter simultaneously? (e.g., atomic increments, locks, or Lua scripts).
Fault Tolerance: If the centralized rate-limiting service or cache goes down, should the system "fail open" (allow all traffic) or "fail closed" (block all traffic)?.
Databricks Specifics: How would this system integrate with a Lakehouse architecture or protect specific RatelimitGroups (RLG) like API endpoints?.

If you'd like, I can provide a detailed technical solution for this problem or a comparison of the best algorithms to use during the interview.

[0] - High Performance Ratelimiting at Databricks [1] - 7: Design a Rate Limiter | Systems Design Interview Questions ... [2] - Design a Rate Limiter - System Design Interview Question [3] - Design a Distributed Rate Limiter [4] - Interview Preparation: Design A Distributed Rate Limiter [5] - Databricks Senior Software Engineer interview questions - Glassdoor [6] - Design A Rate Limiter - ByteByteGo | Technical Interview Prep [7] - Databricks ML Interview Questions - 2026 Guide - AIOfferly [8] - Databricks Rate Limiting: Preventing the “Data Stampede” [9] - Rate Limiting in System Design - GeeksforGeeks [10] - Databricks System Design Interview - Software Engineering [11] - Databricks SWE Interview (L4) : r/InterviewCoderHQ - Reddit

user@intervues:~/databricks$