Practice/Meta/Design a Web Crawler with Limited Communication
Design a Web Crawler with Limited Communication
System DesignMust
Problem Statement
Design a distributed log aggregation system that collects, processes, and stores logs from 50,000 application servers generating a combined 10 TB of logs per day. The system must handle log ingestion at peak rates of 5 million log lines per second, support real-time search and filtering, and retain logs for 90 days. Your design should accommodate machine failures, network partitions, and variable log volumes while maintaining low latency for both writes and queries. Unlike traditional centralized logging solutions, assume you need to explain the full data pipeline from collection through storage and retrieval, including how to handle schema evolution, multi-tenancy, and cost-effective long-term retention.
This problem tests your ability to design a high-throughput, fault-tolerant data pipeline with competing requirements around write performance, query flexibility, and storage efficiency. Interviewers want to see how you balance trade-offs between real-time processing and batch optimization, handle backpressure when downstream systems lag, and design indexes that support diverse query patterns without exploding storage costs.
Key Requirements
Functional
- Log Collection -- agents on application servers must capture structured and unstructured logs and forward them reliably
- Real-Time Ingestion -- the system must accept millions of log entries per second with minimal back-pressure to applications
- Search and Filter -- users must query logs by timestamp, severity, service name, host, and arbitrary text fields within seconds
- Alerting -- users must define threshold-based alerts that trigger when specific log patterns appear
- Retention and Archival -- hot storage for recent 7 days, warm storage for 30 days, and cold archival for 90 days with cost optimization
Non-Functional
- Scalability -- must handle 5M log lines/sec peak and scale horizontally as application fleet grows
- Reliability -- tolerate individual node failures and network partitions without data loss; at-least-once delivery guarantees
- Latency -- end-to-end ingestion latency under 10 seconds; query response time under 5 seconds for typical searches
- Consistency -- eventual consistency is acceptable; recent logs may take seconds to appear in search results
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Log Collection and Buffering Strategy
Interviewers want to understand how you reliably collect logs from thousands of application servers without overwhelming the network or losing data during downstream outages. This reveals your grasp of buffering, batching, and back-pressure handling.
Hints to consider:
- Deploy lightweight agents on each application server that tail log files or capture stdout and buffer logs locally on disk before forwarding
- Use batching and compression to reduce network overhead -- send logs in batches of 100-1000 entries or every few seconds
- Implement exponential back-off and local disk buffering when downstream ingestion tier is unavailable or slow
- Consider push vs. pull models -- agents pushing to load balancers vs. aggregators pulling from agent endpoints
2. Partitioning and Sharding for Write Throughput
At 5M log lines per second, a single ingestion node cannot keep up. Interviewers assess how you partition the write load and ensure even distribution while maintaining queryability.
Hints to consider:
- Partition logs by a consistent hashing scheme over service name or tenant ID to distribute load across ingestion nodes
- Use a message queue or distributed log (Kafka-style) as a buffer between collection and processing to decouple components and enable horizontal scaling
- Shard storage by time ranges (daily or hourly partitions) so queries can prune partitions and writes append to current partition
- Avoid hot partitions by mixing temporal and dimensional sharding -- hash by service within time buckets
3. Indexing for Fast Query Performance
Users need to search across billions of log entries by timestamp, text content, and structured fields. Interviewers look for your understanding of inverted indexes, column stores, and trade-offs between index size and query speed.
Hints to consider:
- Build inverted indexes on high-cardinality fields like message text, service name, and host to support full-text and filter queries
- Use columnar storage formats (Parquet, ORC) for time-series data to enable efficient scanning and compression
- Create secondary indexes on commonly filtered fields (severity, tenant ID) but avoid indexing every field to control storage overhead
- Implement approximate filters (Bloom filters) to skip partitions that definitely don't contain matching logs
4. Tiered Storage and Cost Optimization
Storing 10 TB/day for 90 days costs millions annually if kept on fast SSDs. Interviewers want to see you design a tiered storage strategy that balances access speed with cost.
Hints to consider:
- Keep the most recent 7 days on SSD-backed hot storage with full indexes for low-latency queries
- Move 8-30 day old logs to warm storage on slower disks with reduced indexing or compressed columnar formats
- Archive logs older than 30 days to object storage (S3-like) in compressed batches with minimal indexing; accept slower query times
- Implement lifecycle policies that automatically transition data between tiers and delete logs after 90 days
5. Handling Failures and Exactly-Once Semantics
Log data is valuable for debugging and compliance, so losing logs is unacceptable. Interviewers probe how you ensure durability and idempotency across the pipeline.
Hints to consider:
- Use at-least-once delivery with idempotent processing -- assign unique IDs to log batches and deduplicate on ingestion
- Replicate data across multiple availability zones or data centers to tolerate node and network failures
- Checkpoint processing offsets in the message queue so failed processors can resume without reprocessing or skipping logs
- Implement write-ahead logs or durable queues at each stage so no data is lost even if a component crashes mid-processing
Suggested Approach
Step 1: Clarify Requirements
Start by confirming the scale, access patterns, and trade-offs with your interviewer. Ask:
- What is the expected log volume per server and peak traffic patterns? Are there traffic spikes during deployments or incidents?
- What query patterns do users need -- recent logs for a specific service, full-text search, aggregations over time windows?
- What are the acceptable latencies for ingestion and queries? Can queries take 10-30 seconds for historical data?
- Do logs have a consistent schema, or must the system handle arbitrary JSON and unstructured text?
- What compliance or retention policies apply? Do logs need to be immutable or encrypted?
Clarify whether you should focus on the ingestion pipeline, query engine, or both, and whether the system needs to support alerts and dashboards.