Practice/Amazon/Design a Logger System

Design a Logger System

System DesignMust

Problem Statement

Design a distributed logging system that receives log messages from thousands of application servers and services, indexes them for fast search, and allows operators to query logs by time range, severity, service name, and keywords. The system must handle tens of millions of log entries per second, make logs searchable within seconds of emission, and retain them for compliance and debugging purposes.

At Amazon, interviewers ask this to evaluate your understanding of high-throughput write pipelines, text indexing at scale, and the balance between ingestion speed, query latency, and storage cost. Think of systems like Splunk, the ELK stack, or CloudWatch Logs. The challenge is building a pipeline that absorbs massive write volumes without dropping data while making that data searchable in near real-time.

Key Requirements

Functional

Log ingestion -- accept log streams from hundreds of thousands of sources via multiple protocols (HTTP, syslog, agent-based collectors)
Full-text search -- operators search logs using keywords, field-based filters (service, severity, host), and time range constraints
Aggregation and analytics -- support count, group-by, and percentile queries for dashboards and trend analysis
Pattern-based alerting -- operators define rules that trigger alerts when specific log patterns appear or error rates exceed thresholds

Non-Functional

Scalability -- handle 50+ million log events per second ingestion rate with horizontal scaling
Reliability -- ensure no log data loss during ingestion spikes or component failures with at-least-once delivery
Latency -- logs searchable within 10 seconds of emission; query response under 2 seconds at p95
Consistency -- eventual consistency for search; strict ordering within each log source stream

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Backpressure

Log traffic spikes dramatically during deployments, incidents, and batch jobs. Interviewers want to see a pipeline that absorbs 10x normal load gracefully.

Hints to consider:

Use Kafka as a durable buffer between log producers and indexing consumers, providing ordering per source and replay capability
Deploy lightweight collection agents that batch and compress logs before sending to reduce network overhead
Implement per-tenant rate limiting and admission control to prevent noisy neighbors from impacting other teams
Design a load balancer with health checks that routes traffic away from struggling ingestion nodes

2. Indexing Strategy for Fast Full-Text Search

Making unstructured text searchable at petabyte scale is the core technical challenge. Interviewers expect you to discuss inverted indexes, sharding, and the freshness/performance tradeoff.

Hints to consider:

Partition indexes by time (e.g., hourly or daily) to enable efficient pruning of old data and faster queries on recent logs
Build inverted indexes mapping terms to log entry IDs with positional information for phrase queries
Use field extraction at index time for structured fields (severity, service, host) to enable fast filtered queries
Consider bloom filters to quickly skip index partitions that definitely do not contain search terms

3. Data Retention, Tiering, and Cost

With terabytes per day, raw storage costs become prohibitive. Interviewers want lifecycle management strategies.

Hints to consider:

Implement hot-warm-cold tiering: recent logs on fast SSD-backed storage, older data on cheaper storage, archived to object storage
Apply aggressive compression since logs are write-once and highly compressible (10-20x typical compression ratios)
Support per-team retention policies and quotas to control costs in multi-tenant deployments
Use downsampling for very old data: keep only aggregated statistics while dropping raw entries past the retention window

4. Alerting on Log Patterns

Real-time alerting requires continuous evaluation of streaming data without adding latency to the indexing path. Interviewers look for separation of concerns.

Hints to consider:

Build a separate stream processing path that evaluates alert rules on incoming logs before or during indexing
Implement stateful windowing to detect rate-based conditions (e.g., error rate exceeds 5% in last 5 minutes)
Use deduplication and rate limiting to prevent alert storms during widespread incidents
Support both simple pattern matching (keyword presence) and statistical alerts (error count exceeds 3 standard deviations)

Practice/Amazon/Design a Logger System

Design a Logger System

System DesignMust

Problem Statement

Key Requirements

Functional

Log ingestion -- accept log streams from hundreds of thousands of sources via multiple protocols (HTTP, syslog, agent-based collectors)
Full-text search -- operators search logs using keywords, field-based filters (service, severity, host), and time range constraints
Aggregation and analytics -- support count, group-by, and percentile queries for dashboards and trend analysis
Pattern-based alerting -- operators define rules that trigger alerts when specific log patterns appear or error rates exceed thresholds

Non-Functional

Scalability -- handle 50+ million log events per second ingestion rate with horizontal scaling
Reliability -- ensure no log data loss during ingestion spikes or component failures with at-least-once delivery
Latency -- logs searchable within 10 seconds of emission; query response under 2 seconds at p95
Consistency -- eventual consistency for search; strict ordering within each log source stream

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Ingestion Pipeline and Backpressure

Log traffic spikes dramatically during deployments, incidents, and batch jobs. Interviewers want to see a pipeline that absorbs 10x normal load gracefully.

Hints to consider:

Use Kafka as a durable buffer between log producers and indexing consumers, providing ordering per source and replay capability
Deploy lightweight collection agents that batch and compress logs before sending to reduce network overhead
Implement per-tenant rate limiting and admission control to prevent noisy neighbors from impacting other teams
Design a load balancer with health checks that routes traffic away from struggling ingestion nodes

2. Indexing Strategy for Fast Full-Text Search

Making unstructured text searchable at petabyte scale is the core technical challenge. Interviewers expect you to discuss inverted indexes, sharding, and the freshness/performance tradeoff.

Hints to consider:

Partition indexes by time (e.g., hourly or daily) to enable efficient pruning of old data and faster queries on recent logs
Build inverted indexes mapping terms to log entry IDs with positional information for phrase queries
Use field extraction at index time for structured fields (severity, service, host) to enable fast filtered queries
Consider bloom filters to quickly skip index partitions that definitely do not contain search terms

3. Data Retention, Tiering, and Cost

With terabytes per day, raw storage costs become prohibitive. Interviewers want lifecycle management strategies.

Hints to consider:

Implement hot-warm-cold tiering: recent logs on fast SSD-backed storage, older data on cheaper storage, archived to object storage
Apply aggressive compression since logs are write-once and highly compressible (10-20x typical compression ratios)
Support per-team retention policies and quotas to control costs in multi-tenant deployments
Use downsampling for very old data: keep only aggregated statistics while dropping raw entries past the retention window

4. Alerting on Log Patterns

Real-time alerting requires continuous evaluation of streaming data without adding latency to the indexing path. Interviewers look for separation of concerns.

Hints to consider:

Build a separate stream processing path that evaluates alert rules on incoming logs before or during indexing
Implement stateful windowing to detect rate-based conditions (e.g., error rate exceeds 5% in last 5 minutes)
Use deduplication and rate limiting to prevent alert storms during widespread incidents
Support both simple pattern matching (keyword presence) and statistical alerts (error count exceeds 3 standard deviations)