Practice/Apple/Design a Logger System

Design a Logger System

System DesignMust

Problem Statement

Design a centralized logging system that collects logs from thousands of application instances, deduplicates repeated messages within a time window, stores them durably, and provides fast search. A twist: a message should only be printed/stored if the same message has not been seen in the last 10 seconds.

This tests your ability to design a high-throughput write pipeline with real-time deduplication, multi-tenant isolation (different apps share the system), and a search layer that handles queries over massive volumes of time-series data. You need to balance ingestion speed, storage cost, and query latency.

Key Requirements

Functional

Log ingestion with dedup -- accept log messages with timestamps; suppress duplicates that appeared within the last 10 seconds
Multi-tenant scoping -- logs are isolated by application; each app has its own quotas, access controls, and search namespace
Search and filter -- users search logs by time range, application, severity level, and free-text keywords with fast results
Live tail -- users can stream recent logs in near real time for a given application or filter

Non-Functional

Scalability -- ingest 100K+ log messages per second across all tenants
Latency -- dedup decision in under 10ms; search results in under 500ms for recent data
Durability -- no log loss once accepted; at-least-once delivery from producers
Cost efficiency -- older logs should be stored cheaply; only recent logs need fast access

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Time-Window Deduplication at Scale

The 10-second suppression window means you need to remember recently seen messages and check them on every incoming log — across multiple instances processing in parallel.

Hints to consider:

Define the dedup key clearly: app_id + normalized message text (strip variable parts like timestamps or request IDs)
Use Redis with SETNX and a 10-second TTL for atomic check-and-set: if the key exists, suppress; if not, set it and allow
This must be a shared store (not in-process memory) so all ingestion instances see the same state
Discuss what happens at scale: hot messages create hot keys; consider sharding by hash of the dedup key

2. High-Throughput Ingestion Pipeline

Logs arrive in massive bursts (deployment, error storm). The ingestion path must absorb spikes without dropping data or blocking producers.

Hints to consider:

Decouple producers from consumers using a durable buffer (Kafka) between the API and the processing pipeline
Producers batch messages client-side before sending to reduce network overhead
Use per-partition ordering in Kafka (keyed by app_id) to maintain per-tenant ordering where needed
Implement backpressure: if the pipeline falls behind, apply rate limiting at the API gateway rather than dropping messages silently

3. Search Architecture for Log Data

Users need to find specific log entries across billions of records. Full table scans are not viable.

Hints to consider:

Use a search engine (Elasticsearch) with time-based indices: one index per day or per week, so old indices can be closed/archived
Shard by app_id or tenant so queries are scoped to a subset of shards
For recent logs (last hour), keep a hot cache in Redis or use a dedicated Elasticsearch index with more replicas
Discuss index lifecycle management: hot → warm → cold → frozen → delete, based on age

Practice/Apple/Design a Logger System

Design a Logger System

System DesignMust

Problem Statement

Key Requirements

Functional

Log ingestion with dedup -- accept log messages with timestamps; suppress duplicates that appeared within the last 10 seconds
Multi-tenant scoping -- logs are isolated by application; each app has its own quotas, access controls, and search namespace
Search and filter -- users search logs by time range, application, severity level, and free-text keywords with fast results
Live tail -- users can stream recent logs in near real time for a given application or filter

Non-Functional

Scalability -- ingest 100K+ log messages per second across all tenants
Latency -- dedup decision in under 10ms; search results in under 500ms for recent data
Durability -- no log loss once accepted; at-least-once delivery from producers
Cost efficiency -- older logs should be stored cheaply; only recent logs need fast access

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Time-Window Deduplication at Scale

The 10-second suppression window means you need to remember recently seen messages and check them on every incoming log — across multiple instances processing in parallel.

Hints to consider:

Define the dedup key clearly: app_id + normalized message text (strip variable parts like timestamps or request IDs)
Use Redis with SETNX and a 10-second TTL for atomic check-and-set: if the key exists, suppress; if not, set it and allow
This must be a shared store (not in-process memory) so all ingestion instances see the same state
Discuss what happens at scale: hot messages create hot keys; consider sharding by hash of the dedup key

2. High-Throughput Ingestion Pipeline

Logs arrive in massive bursts (deployment, error storm). The ingestion path must absorb spikes without dropping data or blocking producers.

Hints to consider:

Decouple producers from consumers using a durable buffer (Kafka) between the API and the processing pipeline
Producers batch messages client-side before sending to reduce network overhead
Use per-partition ordering in Kafka (keyed by app_id) to maintain per-tenant ordering where needed
Implement backpressure: if the pipeline falls behind, apply rate limiting at the API gateway rather than dropping messages silently

3. Search Architecture for Log Data

Users need to find specific log entries across billions of records. Full table scans are not viable.

Hints to consider:

Use a search engine (Elasticsearch) with time-based indices: one index per day or per week, so old indices can be closed/archived
Shard by app_id or tenant so queries are scoped to a subset of shards
For recent logs (last hour), keep a hot cache in Redis or use a dedicated Elasticsearch index with more replicas
Discuss index lifecycle management: hot → warm → cold → frozen → delete, based on age