Practice/Microsoft/Design a Logger System

Design a Logger System

System DesignMust

Problem Statement

Design a distributed notification system that sends alerts to users across multiple channels (email, SMS, push) while enforcing per-user rate limits to prevent notification fatigue. The system must handle millions of users generating billions of notification events daily, with configurable rate limits per channel (e.g., maximum 5 emails per hour, 10 push notifications per day). The core challenge is maintaining accurate rate limit state across a distributed fleet of servers while ensuring high availability and low latency. Users expect notifications to be delivered within seconds when limits allow, and the system must gracefully suppress notifications that exceed configured thresholds without losing track of what was sent. Consider how to handle clock skew, partial failures, and the tradeoff between strict accuracy and operational simplicity.

Key Requirements

Functional

Rate limit enforcement -- System must track notification counts per user per channel within sliding time windows and reject notifications that exceed configured thresholds
Multi-channel delivery -- Support sending notifications through email, SMS, and push with independent rate limits for each channel type
Configurable policies -- Allow administrators to define rate limit rules per user segment or tenant with different time windows (per minute, hour, day)
Audit and visibility -- Provide APIs to query current rate limit status for a user and retrieve history of sent and suppressed notifications

Non-Functional

Scalability -- Handle 100,000 notification requests per second with millions of active users, supporting horizontal scaling of both rate limiting and delivery components
Reliability -- Ensure at-least-once delivery for notifications that pass rate limits, with less than 0.1% false rejections due to system errors
Latency -- Make rate limit decisions within 50ms p99 and deliver notifications within 5 seconds of approval
Consistency -- Accept eventual consistency for rate limit counters with bounded staleness (tolerate 1-2% over-delivery during brief windows)

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Distributed Rate Limiting Architecture

The core technical challenge is maintaining accurate rate limit counters when multiple servers process requests for the same user concurrently. Interviewers want to see if you understand the tradeoffs between strong consistency, latency, and operational complexity.

Hints to consider:

Consider using a centralized cache layer with atomic increment operations and TTL-based expiration for time windows
Discuss the tradeoffs between fixed windows (simpler but bursty) versus sliding windows (more accurate but complex to implement)
Explain how to handle cache failures or network partitions -- should you fail open (allow notifications) or fail closed (reject)?
Address the thundering herd problem when many servers check limits for popular users simultaneously

2. Decoupling Request Validation from Delivery

Interviewers expect you to separate the synchronous rate limit check from asynchronous notification delivery to prevent downstream failures from blocking the API. This tests your understanding of queue-based architectures and idempotency.

Hints to consider:

Use a message queue to buffer approved notifications between the API tier and delivery workers
Design idempotency keys so that retries don't double-send notifications or double-count against rate limits
Discuss how to handle delivery failures -- should retry attempts count against rate limits or only successful sends?
Explain monitoring and alerting for queue depth and delivery lag to prevent notifications from becoming stale

3. Sliding Window Implementation

Most candidates start with simple fixed windows (count events in the current hour) but interviewers push for sliding windows that provide smoother rate limiting. This reveals whether you can optimize storage and computation for high-volume scenarios.

Hints to consider:

Consider storing individual event timestamps in a sorted set with O(log N) insertion and range queries
Discuss approximation techniques like the sliding log counter pattern or token bucket algorithm to reduce storage
Explain how to efficiently expire old events and clean up memory without full scans
Address the precision tradeoff -- is it acceptable to use 1-minute buckets within an hourly window?

4. Multi-Tenancy and Isolation

Enterprise notification systems serve multiple applications or customer tenants with different rate limit policies and delivery priorities. Interviewers want to see how you enforce isolation and prevent noisy neighbors.

Hints to consider:

Partition rate limit state by tenant ID and use separate cache namespaces or database schemas
Implement per-tenant delivery queues to prevent one tenant's burst from delaying another's notifications
Discuss API authentication and authorization to ensure tenants can only query their own rate limit status
Consider quota management -- how do you enforce total notification volume per tenant across all their users?

Practice/Microsoft/Design a Logger System

Design a Logger System

System DesignMust

Problem Statement

Key Requirements

Functional

Rate limit enforcement -- System must track notification counts per user per channel within sliding time windows and reject notifications that exceed configured thresholds
Multi-channel delivery -- Support sending notifications through email, SMS, and push with independent rate limits for each channel type
Configurable policies -- Allow administrators to define rate limit rules per user segment or tenant with different time windows (per minute, hour, day)
Audit and visibility -- Provide APIs to query current rate limit status for a user and retrieve history of sent and suppressed notifications

Non-Functional

Scalability -- Handle 100,000 notification requests per second with millions of active users, supporting horizontal scaling of both rate limiting and delivery components
Reliability -- Ensure at-least-once delivery for notifications that pass rate limits, with less than 0.1% false rejections due to system errors
Latency -- Make rate limit decisions within 50ms p99 and deliver notifications within 5 seconds of approval
Consistency -- Accept eventual consistency for rate limit counters with bounded staleness (tolerate 1-2% over-delivery during brief windows)

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Distributed Rate Limiting Architecture

Hints to consider:

Consider using a centralized cache layer with atomic increment operations and TTL-based expiration for time windows
Discuss the tradeoffs between fixed windows (simpler but bursty) versus sliding windows (more accurate but complex to implement)
Explain how to handle cache failures or network partitions -- should you fail open (allow notifications) or fail closed (reject)?
Address the thundering herd problem when many servers check limits for popular users simultaneously

2. Decoupling Request Validation from Delivery

Hints to consider:

Use a message queue to buffer approved notifications between the API tier and delivery workers
Design idempotency keys so that retries don't double-send notifications or double-count against rate limits
Discuss how to handle delivery failures -- should retry attempts count against rate limits or only successful sends?
Explain monitoring and alerting for queue depth and delivery lag to prevent notifications from becoming stale

3. Sliding Window Implementation

Hints to consider:

Consider storing individual event timestamps in a sorted set with O(log N) insertion and range queries
Discuss approximation techniques like the sliding log counter pattern or token bucket algorithm to reduce storage
Explain how to efficiently expire old events and clean up memory without full scans
Address the precision tradeoff -- is it acceptable to use 1-minute buckets within an hourly window?

4. Multi-Tenancy and Isolation

Hints to consider:

Partition rate limit state by tenant ID and use separate cache namespaces or database schemas
Implement per-tenant delivery queues to prevent one tenant's burst from delaying another's notifications
Discuss API authentication and authorization to ensure tenants can only query their own rate limit status
Consider quota management -- how do you enforce total notification volume per tenant across all their users?