Practice/Google/Design a Logger System

Design a Logger System

System DesignMust

Problem Statement

Design a centralized logging system that ingests log messages from thousands of distributed services, deduplicates them within a configurable time window (for example, 10 seconds), and provides search, filtering, and live-tail capabilities. Think of a simplified version of Datadog Logs or the ELK stack.

The primary technical challenge is handling extremely high ingest throughput — potentially millions of log lines per second — while performing time-window deduplication to suppress repeated identical messages that flood the system during incidents. Duplicate suppression must be fast and memory-efficient, since the deduplication window can contain millions of unique message fingerprints.

Your design should also support multi-tenant isolation so that different teams or services see only their own logs, along with query capabilities for searching historical logs by time range, severity, service name, and keywords.

Key Requirements

Functional

Log ingestion -- Accept structured log entries (timestamp, service, severity, message, metadata) from producers at high throughput.
Time-window deduplication -- Suppress duplicate log messages within a configurable window (default 10 seconds) based on content fingerprint.
Search and filter -- Query historical logs by time range, severity level, service name, and free-text keyword search.
Live tail -- Stream new log entries matching a filter in real time to a developer's terminal or dashboard.

Non-Functional

Scalability -- Ingest 1 million+ log lines per second across all tenants.
Latency -- Logs appear in search results within 5 seconds of ingestion; live tail within 2 seconds.
Durability -- Ingested logs must not be lost, even during consumer failures.
Multi-tenancy -- Strict isolation between tenants with independent retention policies.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Time-Window Deduplication

Suppressing duplicates within a sliding window at high throughput is the core algorithmic challenge. Hints to consider:

Hash each log message to a fingerprint and track seen fingerprints in a Redis key with TTL equal to the dedup window
How do you define "duplicate" — exact message match, or do you normalize timestamps and request IDs first?
Consider the memory cost of storing millions of fingerprints and whether a probabilistic structure like a Bloom filter is acceptable
What happens at the boundary of the time window — can a message slip through as both deduplicated and new?

2. High-Throughput Ingestion Pipeline

Moving millions of log lines per second from producers to storage without dropping messages requires careful pipeline design. Hints to consider:

Use Kafka as a durable buffer between producers and consumers to absorb traffic spikes
How do you partition Kafka topics — by tenant, by service, or by hash of the log fingerprint?
Consider the tradeoff between batching logs for throughput and flushing frequently for low latency
What backpressure mechanism prevents producers from overwhelming the pipeline?

3. Search and Indexing

Querying billions of log entries efficiently requires the right storage and indexing strategy. Hints to consider:

Elasticsearch is a natural fit — how do you design the index strategy (time-based indices, one per tenant)?
Consider using index lifecycle management to roll over and eventually delete old indices based on retention policy
How do you handle free-text search performance when log messages contain highly variable content?
What is the tradeoff between indexing every field and storage cost?