Practice/Lyft/Design Wikipedia Crawler
Design Wikipedia Crawler
System DesignMust
Problem Statement
Design a distributed web crawler that can systematically discover, fetch, and store all pages on Wikipedia at scale. The system must handle tens of millions of pages, respect crawl rate limits to avoid overloading Wikipedia's servers, detect and process page updates efficiently, and store the crawled content in a searchable format for downstream consumers such as search indexes or analytics pipelines.
This problem tests your understanding of distributed systems coordination, politeness policies, URL frontier management, and fault tolerance in long-running batch processes. At Lyft, interviewers use this question to evaluate how you decompose a large-scale data ingestion problem into parallelizable components, handle deduplication of URLs, and design for graceful recovery when individual workers fail mid-crawl.
The key tension is between crawl speed and politeness. Wikipedia has millions of pages that change at different rates, and your system needs to prioritize fresh content while respecting robots.txt directives and rate limits. You also need to handle edge cases like redirect chains, duplicate content across URLs, and pages that change frequently versus those that are essentially static.
Key Requirements
Functional
- URL discovery and frontier management -- Maintain a prioritized queue of URLs to crawl, seeded from Wikipedia's sitemap and expanded by extracting links from fetched pages
- Content fetching and parsing -- Download HTML pages, extract text content and metadata, and identify outgoing links for further crawling
- Deduplication -- Detect and skip duplicate URLs (including canonicalization of different URL formats pointing to the same page) and duplicate content
- Incremental re-crawling -- Track when pages were last crawled and prioritize re-fetching pages that are likely to have changed based on historical update frequency
Non-Functional
- Scalability -- Crawl 50 million pages within 24 hours using a horizontally scalable worker fleet
- Politeness -- Enforce per-domain rate limits (no more than 1 request per second to any single host) and respect robots.txt directives
- Reliability -- Tolerate worker failures without losing progress; resume interrupted crawls from the last checkpoint
- Storage efficiency -- Compress and deduplicate stored content to minimize storage costs while maintaining fast retrieval
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. URL Frontier Design and Prioritization
The URL frontier is the heart of any web crawler. Interviewers want to see how you balance breadth-first discovery with priority-based scheduling, and how you prevent the frontier from growing unboundedly or becoming a bottleneck.
Hints to consider:
- Use a two-level queue: a priority queue that selects which URLs to crawl next based on importance, and per-host queues that enforce politeness delays
- Implement priority scoring based on page rank, last-modified headers, historical change frequency, and depth from seed URLs
- Shard the frontier across multiple nodes by hashing the domain name to ensure per-host politeness is enforced locally
- Use a Bloom filter or similar probabilistic data structure to cheaply check whether a URL has already been enqueued
2. Distributed Coordination and Partition Strategy
With hundreds of workers crawling concurrently, you need a strategy to divide work without conflicts, minimize coordination overhead, and ensure no pages are missed or crawled redundantly.
Hints to consider:
- Partition URLs by domain hash so each worker is responsible for a set of domains, which naturally enforces per-domain rate limits
- Use a central coordinator (or consistent hashing ring) to assign domain partitions to workers and rebalance when workers join or leave
- Workers pull batches of URLs from their assigned partition rather than individual URLs to reduce coordination frequency
- Implement heartbeat-based failure detection so abandoned partitions are reassigned within seconds
3. Deduplication at Scale
Wikipedia has many URLs that resolve to the same content (redirects, language variants, URL parameters). Interviewers expect you to handle both URL-level and content-level deduplication efficiently.
Hints to consider:
- Canonicalize URLs by normalizing scheme, removing trailing slashes, sorting query parameters, and following redirects to their final destination
- Use content fingerprinting (SimHash or MinHash) to detect near-duplicate pages that differ only in boilerplate or timestamps
- Store URL-to-content-hash mappings in a distributed key-value store like Redis or DynamoDB for fast lookup during crawl
- Maintain a Bloom filter with billions of entries for fast negative lookups on previously seen URLs
4. Fault Tolerance and Checkpointing
A full crawl of Wikipedia takes hours or days. Workers will crash, network requests will fail, and the system must continue making progress without re-crawling everything from scratch.
Hints to consider:
- Persist the frontier state periodically to durable storage so a crashed coordinator can resume from the last checkpoint
- Use visibility timeouts on frontier entries: if a worker claims a URL batch but fails to report completion, the batch becomes available again
- Store crawl results incrementally as each page is fetched rather than batching writes, so completed work survives worker crashes
- Implement exponential backoff with jitter for transient HTTP errors (429, 503) and permanent failure detection for consistently failing URLs