Practice/Amazon/Design a Logger System
Design a Logger System
System DesignMust
Problem Statement
Design a distributed logging system that receives log messages from thousands of application servers and services, indexes them for fast search, and allows operators to query logs by time range, severity, service name, and keywords. The system must handle tens of millions of log entries per second, make logs searchable within seconds of emission, and retain them for compliance and debugging purposes.
At Amazon, interviewers ask this to evaluate your understanding of high-throughput write pipelines, text indexing at scale, and the balance between ingestion speed, query latency, and storage cost. Think of systems like Splunk, the ELK stack, or CloudWatch Logs. The challenge is building a pipeline that absorbs massive write volumes without dropping data while making that data searchable in near real-time.
Key Requirements
Functional
- Log ingestion -- accept log streams from hundreds of thousands of sources via multiple protocols (HTTP, syslog, agent-based collectors)
- Full-text search -- operators search logs using keywords, field-based filters (service, severity, host), and time range constraints
- Aggregation and analytics -- support count, group-by, and percentile queries for dashboards and trend analysis
- Pattern-based alerting -- operators define rules that trigger alerts when specific log patterns appear or error rates exceed thresholds
Non-Functional
- Scalability -- handle 50+ million log events per second ingestion rate with horizontal scaling
- Reliability -- ensure no log data loss during ingestion spikes or component failures with at-least-once delivery
- Latency -- logs searchable within 10 seconds of emission; query response under 2 seconds at p95
- Consistency -- eventual consistency for search; strict ordering within each log source stream
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Ingestion Pipeline and Backpressure
Log traffic spikes dramatically during deployments, incidents, and batch jobs. Interviewers want to see a pipeline that absorbs 10x normal load gracefully.
Hints to consider:
- Use Kafka as a durable buffer between log producers and indexing consumers, providing ordering per source and replay capability
- Deploy lightweight collection agents that batch and compress logs before sending to reduce network overhead
- Implement per-tenant rate limiting and admission control to prevent noisy neighbors from impacting other teams
- Design a load balancer with health checks that routes traffic away from struggling ingestion nodes
2. Indexing Strategy for Fast Full-Text Search
Making unstructured text searchable at petabyte scale is the core technical challenge. Interviewers expect you to discuss inverted indexes, sharding, and the freshness/performance tradeoff.
Hints to consider:
- Partition indexes by time (e.g., hourly or daily) to enable efficient pruning of old data and faster queries on recent logs
- Build inverted indexes mapping terms to log entry IDs with positional information for phrase queries
- Use field extraction at index time for structured fields (severity, service, host) to enable fast filtered queries
- Consider bloom filters to quickly skip index partitions that definitely do not contain search terms
3. Data Retention, Tiering, and Cost
With terabytes per day, raw storage costs become prohibitive. Interviewers want lifecycle management strategies.
Hints to consider:
- Implement hot-warm-cold tiering: recent logs on fast SSD-backed storage, older data on cheaper storage, archived to object storage
- Apply aggressive compression since logs are write-once and highly compressible (10-20x typical compression ratios)
- Support per-team retention policies and quotas to control costs in multi-tenant deployments
- Use downsampling for very old data: keep only aggregated statistics while dropping raw entries past the retention window
4. Alerting on Log Patterns
Real-time alerting requires continuous evaluation of streaming data without adding latency to the indexing path. Interviewers look for separation of concerns.
Hints to consider:
- Build a separate stream processing path that evaluates alert rules on incoming logs before or during indexing
- Implement stateful windowing to detect rate-based conditions (e.g., error rate exceeds 5% in last 5 minutes)
- Use deduplication and rate limiting to prevent alert storms during widespread incidents
- Support both simple pattern matching (keyword presence) and statistical alerts (error count exceeds 3 standard deviations)