Practice/Google/Design Wikipedia Crawler
Design Wikipedia Crawler
System DesignMust
Problem Statement
You are designing a distributed web crawler that systematically downloads and indexes all pages on Wikipedia. The crawler must traverse the link graph starting from a set of seed URLs, fetch page content, extract outgoing links, and continue the process until all reachable pages have been visited. Wikipedia contains over 60 million articles across 300+ language editions, with pages being added and modified continuously, so the crawler must also support periodic re-crawling to keep its index fresh.
The core challenges revolve around coordination and efficiency. Multiple crawler workers must operate in parallel without duplicating effort -- two workers should never fetch the same URL simultaneously or redundantly. The crawler must respect rate limits to avoid overwhelming Wikipedia's servers, handle transient failures gracefully (timeouts, 5xx errors, network partitions), and detect when a page has not changed since the last crawl to skip unnecessary processing. The system should prioritize frequently updated pages and high-traffic articles over rarely changed backwaters.
Scaling the crawler across multiple machines introduces further complexity: distributing URLs to workers, maintaining a shared frontier of URLs to visit, tracking which pages have already been crawled, and ensuring the system makes forward progress even when individual workers fail.
Key Requirements
Functional
- URL discovery and traversal -- Starting from seed URLs, the crawler extracts links from fetched pages and adds unseen URLs to the crawl frontier
- Content fetching and storage -- The crawler downloads page HTML, extracts meaningful text content, and stores it for downstream indexing
- Deduplication -- The system avoids fetching the same URL twice within a crawl cycle and detects content-level duplicates across different URLs
- Incremental re-crawling -- Pages are re-crawled periodically based on a priority schedule, with frequently changing pages visited more often
Non-Functional
- Scalability -- Crawl all 60M+ Wikipedia pages within 24 hours using a distributed fleet of workers
- Politeness -- Respect
robots.txt directives and enforce per-domain rate limits to avoid overloading Wikipedia servers
- Fault tolerance -- Individual worker failures do not cause data loss or stalled progress; the system recovers automatically
- Efficiency -- Minimize redundant fetches by tracking content hashes and using conditional HTTP requests (
If-Modified-Since)
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. URL Frontier and Worker Coordination
The crawl frontier -- the queue of URLs waiting to be fetched -- is the heart of the system. Interviewers want to see how you partition it across workers, enforce politeness constraints, and prevent duplicate work.
Hints to consider:
- Think about partitioning the frontier by domain (or hostname hash) so that all URLs for a given domain are routed to the same queue, making rate limiting straightforward
- Consider separating the frontier into a front queue (prioritized) and back queues (per-domain, rate-limited) for a two-level scheduling approach
- Explore how to handle the case where a worker pulls a URL but crashes before marking it as processed
- Think about using lease-based assignment with TTLs so that unacknowledged URLs are automatically re-enqueued
2. Deduplication at Scale
With 60M+ pages and billions of extracted links, the crawler must efficiently determine whether a URL has already been seen or a page's content has already been stored.
Hints to consider:
- Consider using a Bloom filter for fast probabilistic URL deduplication with controlled false-positive rates
- Think about content-level deduplication using MD5 or SHA-256 hashes of page content to detect pages served at different URLs
- Explore how to handle URL canonicalization (trailing slashes, query parameter ordering, fragments) to reduce spurious duplicates
- Consider the trade-off between in-memory data structures (fast but memory-bound) and persistent stores (slower but unlimited)
3. Rate Limiting and Politeness
Crawling too aggressively can get your IP blocked and creates ethical concerns. Interviewers expect you to design proper throttling mechanisms.
Hints to consider:
- Think about implementing per-domain token buckets that regulate the maximum fetch rate for each hostname
- Consider parsing and caching
robots.txt for each domain and respecting crawl-delay directives
- Explore how to distribute rate limiting state when multiple workers may target the same domain
- Think about adaptive rate limiting that backs off when receiving 429 or 503 responses
4. Fault Tolerance and Progress Tracking
Workers are ephemeral and the network is unreliable. The system must make steady forward progress despite continuous failures.
Hints to consider:
- Think about checkpointing crawl state periodically so that a restart does not require re-crawling everything
- Consider using an acknowledgment protocol where workers confirm successful processing before URLs are removed from the frontier
- Explore how to handle poison URLs that consistently cause worker crashes (e.g., malformed HTML, infinite redirect loops)
- Think about dead-letter queues for URLs that fail repeatedly, with alerts for human review