Practice/Meta/Design a Web Crawler with Limited Communication

Design a Web Crawler with Limited Communication

System DesignMust

Problem Statement

Design a distributed log aggregation system that collects, processes, and stores logs from 50,000 application servers generating a combined 10 TB of logs per day. The system must handle log ingestion at peak rates of 5 million log lines per second, support real-time search and filtering, and retain logs for 90 days. Your design should accommodate machine failures, network partitions, and variable log volumes while maintaining low latency for both writes and queries. Unlike traditional centralized logging solutions, assume you need to explain the full data pipeline from collection through storage and retrieval, including how to handle schema evolution, multi-tenancy, and cost-effective long-term retention.

This problem tests your ability to design a high-throughput, fault-tolerant data pipeline with competing requirements around write performance, query flexibility, and storage efficiency. Interviewers want to see how you balance trade-offs between real-time processing and batch optimization, handle backpressure when downstream systems lag, and design indexes that support diverse query patterns without exploding storage costs.

Key Requirements

Functional

Log Collection -- agents on application servers must capture structured and unstructured logs and forward them reliably
Real-Time Ingestion -- the system must accept millions of log entries per second with minimal back-pressure to applications
Search and Filter -- users must query logs by timestamp, severity, service name, host, and arbitrary text fields within seconds
Alerting -- users must define threshold-based alerts that trigger when specific log patterns appear
Retention and Archival -- hot storage for recent 7 days, warm storage for 30 days, and cold archival for 90 days with cost optimization

Non-Functional

Scalability -- must handle 5M log lines/sec peak and scale horizontally as application fleet grows
Reliability -- tolerate individual node failures and network partitions without data loss; at-least-once delivery guarantees
Latency -- end-to-end ingestion latency under 10 seconds; query response time under 5 seconds for typical searches
Consistency -- eventual consistency is acceptable; recent logs may take seconds to appear in search results

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Log Collection and Buffering Strategy

Interviewers want to understand how you reliably collect logs from thousands of application servers without overwhelming the network or losing data during downstream outages. This reveals your grasp of buffering, batching, and back-pressure handling.

Hints to consider:

Deploy lightweight agents on each application server that tail log files or capture stdout and buffer logs locally on disk before forwarding
Use batching and compression to reduce network overhead -- send logs in batches of 100-1000 entries or every few seconds
Implement exponential back-off and local disk buffering when downstream ingestion tier is unavailable or slow
Consider push vs. pull models -- agents pushing to load balancers vs. aggregators pulling from agent endpoints

2. Partitioning and Sharding for Write Throughput

At 5M log lines per second, a single ingestion node cannot keep up. Interviewers assess how you partition the write load and ensure even distribution while maintaining queryability.

Hints to consider:

Partition logs by a consistent hashing scheme over service name or tenant ID to distribute load across ingestion nodes
Use a message queue or distributed log (Kafka-style) as a buffer between collection and processing to decouple components and enable horizontal scaling
Shard storage by time ranges (daily or hourly partitions) so queries can prune partitions and writes append to current partition
Avoid hot partitions by mixing temporal and dimensional sharding -- hash by service within time buckets

3. Indexing for Fast Query Performance

Users need to search across billions of log entries by timestamp, text content, and structured fields. Interviewers look for your understanding of inverted indexes, column stores, and trade-offs between index size and query speed.

Hints to consider:

Build inverted indexes on high-cardinality fields like message text, service name, and host to support full-text and filter queries
Use columnar storage formats (Parquet, ORC) for time-series data to enable efficient scanning and compression
Create secondary indexes on commonly filtered fields (severity, tenant ID) but avoid indexing every field to control storage overhead
Implement approximate filters (Bloom filters) to skip partitions that definitely don't contain matching logs

4. Tiered Storage and Cost Optimization

Storing 10 TB/day for 90 days costs millions annually if kept on fast SSDs. Interviewers want to see you design a tiered storage strategy that balances access speed with cost.

Hints to consider:

Keep the most recent 7 days on SSD-backed hot storage with full indexes for low-latency queries
Move 8-30 day old logs to warm storage on slower disks with reduced indexing or compressed columnar formats
Archive logs older than 30 days to object storage (S3-like) in compressed batches with minimal indexing; accept slower query times
Implement lifecycle policies that automatically transition data between tiers and delete logs after 90 days

5. Handling Failures and Exactly-Once Semantics

Log data is valuable for debugging and compliance, so losing logs is unacceptable. Interviewers probe how you ensure durability and idempotency across the pipeline.

Hints to consider:

Use at-least-once delivery with idempotent processing -- assign unique IDs to log batches and deduplicate on ingestion
Replicate data across multiple availability zones or data centers to tolerate node and network failures
Checkpoint processing offsets in the message queue so failed processors can resume without reprocessing or skipping logs
Implement write-ahead logs or durable queues at each stage so no data is lost even if a component crashes mid-processing

Suggested Approach

Step 1: Clarify Requirements

Start by confirming the scale, access patterns, and trade-offs with your interviewer. Ask:

What is the expected log volume per server and peak traffic patterns? Are there traffic spikes during deployments or incidents?
What query patterns do users need -- recent logs for a specific service, full-text search, aggregations over time windows?
What are the acceptable latencies for ingestion and queries? Can queries take 10-30 seconds for historical data?
Do logs have a consistent schema, or must the system handle arbitrary JSON and unstructured text?
What compliance or retention policies apply? Do logs need to be immutable or encrypted?

Clarify whether you should focus on the ingestion pipeline, query engine, or both, and whether the system needs to support alerts and dashboards.

Practice/Meta/Design a Web Crawler with Limited Communication

Design a Web Crawler with Limited Communication

System DesignMust

Problem Statement

Key Requirements

Functional

Log Collection -- agents on application servers must capture structured and unstructured logs and forward them reliably
Real-Time Ingestion -- the system must accept millions of log entries per second with minimal back-pressure to applications
Search and Filter -- users must query logs by timestamp, severity, service name, host, and arbitrary text fields within seconds
Alerting -- users must define threshold-based alerts that trigger when specific log patterns appear
Retention and Archival -- hot storage for recent 7 days, warm storage for 30 days, and cold archival for 90 days with cost optimization

Non-Functional

Scalability -- must handle 5M log lines/sec peak and scale horizontally as application fleet grows
Reliability -- tolerate individual node failures and network partitions without data loss; at-least-once delivery guarantees
Latency -- end-to-end ingestion latency under 10 seconds; query response time under 5 seconds for typical searches
Consistency -- eventual consistency is acceptable; recent logs may take seconds to appear in search results

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Log Collection and Buffering Strategy

Hints to consider:

Deploy lightweight agents on each application server that tail log files or capture stdout and buffer logs locally on disk before forwarding
Use batching and compression to reduce network overhead -- send logs in batches of 100-1000 entries or every few seconds
Implement exponential back-off and local disk buffering when downstream ingestion tier is unavailable or slow
Consider push vs. pull models -- agents pushing to load balancers vs. aggregators pulling from agent endpoints

2. Partitioning and Sharding for Write Throughput

At 5M log lines per second, a single ingestion node cannot keep up. Interviewers assess how you partition the write load and ensure even distribution while maintaining queryability.

Hints to consider:

Partition logs by a consistent hashing scheme over service name or tenant ID to distribute load across ingestion nodes
Use a message queue or distributed log (Kafka-style) as a buffer between collection and processing to decouple components and enable horizontal scaling
Shard storage by time ranges (daily or hourly partitions) so queries can prune partitions and writes append to current partition
Avoid hot partitions by mixing temporal and dimensional sharding -- hash by service within time buckets

3. Indexing for Fast Query Performance

Hints to consider:

Build inverted indexes on high-cardinality fields like message text, service name, and host to support full-text and filter queries
Use columnar storage formats (Parquet, ORC) for time-series data to enable efficient scanning and compression
Create secondary indexes on commonly filtered fields (severity, tenant ID) but avoid indexing every field to control storage overhead
Implement approximate filters (Bloom filters) to skip partitions that definitely don't contain matching logs

4. Tiered Storage and Cost Optimization

Storing 10 TB/day for 90 days costs millions annually if kept on fast SSDs. Interviewers want to see you design a tiered storage strategy that balances access speed with cost.

Hints to consider:

Keep the most recent 7 days on SSD-backed hot storage with full indexes for low-latency queries
Move 8-30 day old logs to warm storage on slower disks with reduced indexing or compressed columnar formats
Archive logs older than 30 days to object storage (S3-like) in compressed batches with minimal indexing; accept slower query times
Implement lifecycle policies that automatically transition data between tiers and delete logs after 90 days

5. Handling Failures and Exactly-Once Semantics

Log data is valuable for debugging and compliance, so losing logs is unacceptable. Interviewers probe how you ensure durability and idempotency across the pipeline.

Hints to consider:

Use at-least-once delivery with idempotent processing -- assign unique IDs to log batches and deduplicate on ingestion
Replicate data across multiple availability zones or data centers to tolerate node and network failures
Checkpoint processing offsets in the message queue so failed processors can resume without reprocessing or skipping logs
Implement write-ahead logs or durable queues at each stage so no data is lost even if a component crashes mid-processing

Suggested Approach

Step 1: Clarify Requirements

Start by confirming the scale, access patterns, and trade-offs with your interviewer. Ask:

What is the expected log volume per server and peak traffic patterns? Are there traffic spikes during deployments or incidents?
What query patterns do users need -- recent logs for a specific service, full-text search, aggregations over time windows?
What are the acceptable latencies for ingestion and queries? Can queries take 10-30 seconds for historical data?
Do logs have a consistent schema, or must the system handle arbitrary JSON and unstructured text?
What compliance or retention policies apply? Do logs need to be immutable or encrypted?

Clarify whether you should focus on the ingestion pipeline, query engine, or both, and whether the system needs to support alerts and dashboards.