Practice/Microsoft/Design a Logger System
Design a Logger System
System DesignMust
Problem Statement
Design a distributed notification system that sends alerts to users across multiple channels (email, SMS, push) while enforcing per-user rate limits to prevent notification fatigue. The system must handle millions of users generating billions of notification events daily, with configurable rate limits per channel (e.g., maximum 5 emails per hour, 10 push notifications per day). The core challenge is maintaining accurate rate limit state across a distributed fleet of servers while ensuring high availability and low latency. Users expect notifications to be delivered within seconds when limits allow, and the system must gracefully suppress notifications that exceed configured thresholds without losing track of what was sent. Consider how to handle clock skew, partial failures, and the tradeoff between strict accuracy and operational simplicity.
Key Requirements
Functional
- Rate limit enforcement -- System must track notification counts per user per channel within sliding time windows and reject notifications that exceed configured thresholds
- Multi-channel delivery -- Support sending notifications through email, SMS, and push with independent rate limits for each channel type
- Configurable policies -- Allow administrators to define rate limit rules per user segment or tenant with different time windows (per minute, hour, day)
- Audit and visibility -- Provide APIs to query current rate limit status for a user and retrieve history of sent and suppressed notifications
Non-Functional
- Scalability -- Handle 100,000 notification requests per second with millions of active users, supporting horizontal scaling of both rate limiting and delivery components
- Reliability -- Ensure at-least-once delivery for notifications that pass rate limits, with less than 0.1% false rejections due to system errors
- Latency -- Make rate limit decisions within 50ms p99 and deliver notifications within 5 seconds of approval
- Consistency -- Accept eventual consistency for rate limit counters with bounded staleness (tolerate 1-2% over-delivery during brief windows)
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Distributed Rate Limiting Architecture
The core technical challenge is maintaining accurate rate limit counters when multiple servers process requests for the same user concurrently. Interviewers want to see if you understand the tradeoffs between strong consistency, latency, and operational complexity.
Hints to consider:
- Consider using a centralized cache layer with atomic increment operations and TTL-based expiration for time windows
- Discuss the tradeoffs between fixed windows (simpler but bursty) versus sliding windows (more accurate but complex to implement)
- Explain how to handle cache failures or network partitions -- should you fail open (allow notifications) or fail closed (reject)?
- Address the thundering herd problem when many servers check limits for popular users simultaneously
2. Decoupling Request Validation from Delivery
Interviewers expect you to separate the synchronous rate limit check from asynchronous notification delivery to prevent downstream failures from blocking the API. This tests your understanding of queue-based architectures and idempotency.
Hints to consider:
- Use a message queue to buffer approved notifications between the API tier and delivery workers
- Design idempotency keys so that retries don't double-send notifications or double-count against rate limits
- Discuss how to handle delivery failures -- should retry attempts count against rate limits or only successful sends?
- Explain monitoring and alerting for queue depth and delivery lag to prevent notifications from becoming stale
3. Sliding Window Implementation
Most candidates start with simple fixed windows (count events in the current hour) but interviewers push for sliding windows that provide smoother rate limiting. This reveals whether you can optimize storage and computation for high-volume scenarios.
Hints to consider:
- Consider storing individual event timestamps in a sorted set with O(log N) insertion and range queries
- Discuss approximation techniques like the sliding log counter pattern or token bucket algorithm to reduce storage
- Explain how to efficiently expire old events and clean up memory without full scans
- Address the precision tradeoff -- is it acceptable to use 1-minute buckets within an hourly window?
4. Multi-Tenancy and Isolation
Enterprise notification systems serve multiple applications or customer tenants with different rate limit policies and delivery priorities. Interviewers want to see how you enforce isolation and prevent noisy neighbors.
Hints to consider:
- Partition rate limit state by tenant ID and use separate cache namespaces or database schemas
- Implement per-tenant delivery queues to prevent one tenant's burst from delaying another's notifications
- Discuss API authentication and authorization to ensure tenants can only query their own rate limit status
- Consider quota management -- how do you enforce total notification volume per tenant across all their users?