Practice/Google/Design Wikipedia Crawler

Design Wikipedia Crawler

System DesignMust

Problem Statement

You are designing a distributed web crawler that systematically downloads and indexes all pages on Wikipedia. The crawler must traverse the link graph starting from a set of seed URLs, fetch page content, extract outgoing links, and continue the process until all reachable pages have been visited. Wikipedia contains over 60 million articles across 300+ language editions, with pages being added and modified continuously, so the crawler must also support periodic re-crawling to keep its index fresh.

The core challenges revolve around coordination and efficiency. Multiple crawler workers must operate in parallel without duplicating effort -- two workers should never fetch the same URL simultaneously or redundantly. The crawler must respect rate limits to avoid overwhelming Wikipedia's servers, handle transient failures gracefully (timeouts, 5xx errors, network partitions), and detect when a page has not changed since the last crawl to skip unnecessary processing. The system should prioritize frequently updated pages and high-traffic articles over rarely changed backwaters.

Scaling the crawler across multiple machines introduces further complexity: distributing URLs to workers, maintaining a shared frontier of URLs to visit, tracking which pages have already been crawled, and ensuring the system makes forward progress even when individual workers fail.

Key Requirements

Functional

URL discovery and traversal -- Starting from seed URLs, the crawler extracts links from fetched pages and adds unseen URLs to the crawl frontier
Content fetching and storage -- The crawler downloads page HTML, extracts meaningful text content, and stores it for downstream indexing
Deduplication -- The system avoids fetching the same URL twice within a crawl cycle and detects content-level duplicates across different URLs
Incremental re-crawling -- Pages are re-crawled periodically based on a priority schedule, with frequently changing pages visited more often

Non-Functional

Scalability -- Crawl all 60M+ Wikipedia pages within 24 hours using a distributed fleet of workers
Politeness -- Respect robots.txt directives and enforce per-domain rate limits to avoid overloading Wikipedia servers
Fault tolerance -- Individual worker failures do not cause data loss or stalled progress; the system recovers automatically
Efficiency -- Minimize redundant fetches by tracking content hashes and using conditional HTTP requests (If-Modified-Since)

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. URL Frontier and Worker Coordination

The crawl frontier -- the queue of URLs waiting to be fetched -- is the heart of the system. Interviewers want to see how you partition it across workers, enforce politeness constraints, and prevent duplicate work.

Hints to consider:

Think about partitioning the frontier by domain (or hostname hash) so that all URLs for a given domain are routed to the same queue, making rate limiting straightforward
Consider separating the frontier into a front queue (prioritized) and back queues (per-domain, rate-limited) for a two-level scheduling approach
Explore how to handle the case where a worker pulls a URL but crashes before marking it as processed
Think about using lease-based assignment with TTLs so that unacknowledged URLs are automatically re-enqueued

2. Deduplication at Scale

With 60M+ pages and billions of extracted links, the crawler must efficiently determine whether a URL has already been seen or a page's content has already been stored.

Hints to consider:

Consider using a Bloom filter for fast probabilistic URL deduplication with controlled false-positive rates
Think about content-level deduplication using MD5 or SHA-256 hashes of page content to detect pages served at different URLs
Explore how to handle URL canonicalization (trailing slashes, query parameter ordering, fragments) to reduce spurious duplicates
Consider the trade-off between in-memory data structures (fast but memory-bound) and persistent stores (slower but unlimited)

3. Rate Limiting and Politeness

Crawling too aggressively can get your IP blocked and creates ethical concerns. Interviewers expect you to design proper throttling mechanisms.

Hints to consider:

Think about implementing per-domain token buckets that regulate the maximum fetch rate for each hostname
Consider parsing and caching robots.txt for each domain and respecting crawl-delay directives
Explore how to distribute rate limiting state when multiple workers may target the same domain
Think about adaptive rate limiting that backs off when receiving 429 or 503 responses

4. Fault Tolerance and Progress Tracking

Workers are ephemeral and the network is unreliable. The system must make steady forward progress despite continuous failures.

Hints to consider:

Think about checkpointing crawl state periodically so that a restart does not require re-crawling everything
Consider using an acknowledgment protocol where workers confirm successful processing before URLs are removed from the frontier
Explore how to handle poison URLs that consistently cause worker crashes (e.g., malformed HTML, infinite redirect loops)
Think about dead-letter queues for URLs that fail repeatedly, with alerts for human review

Practice/Google/Design Wikipedia Crawler

Design Wikipedia Crawler

System DesignMust

Problem Statement

Key Requirements

Functional

URL discovery and traversal -- Starting from seed URLs, the crawler extracts links from fetched pages and adds unseen URLs to the crawl frontier
Content fetching and storage -- The crawler downloads page HTML, extracts meaningful text content, and stores it for downstream indexing
Deduplication -- The system avoids fetching the same URL twice within a crawl cycle and detects content-level duplicates across different URLs
Incremental re-crawling -- Pages are re-crawled periodically based on a priority schedule, with frequently changing pages visited more often

Non-Functional

Scalability -- Crawl all 60M+ Wikipedia pages within 24 hours using a distributed fleet of workers
Politeness -- Respect robots.txt directives and enforce per-domain rate limits to avoid overloading Wikipedia servers
Fault tolerance -- Individual worker failures do not cause data loss or stalled progress; the system recovers automatically
Efficiency -- Minimize redundant fetches by tracking content hashes and using conditional HTTP requests (If-Modified-Since)

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. URL Frontier and Worker Coordination

Hints to consider:

Think about partitioning the frontier by domain (or hostname hash) so that all URLs for a given domain are routed to the same queue, making rate limiting straightforward
Consider separating the frontier into a front queue (prioritized) and back queues (per-domain, rate-limited) for a two-level scheduling approach
Explore how to handle the case where a worker pulls a URL but crashes before marking it as processed
Think about using lease-based assignment with TTLs so that unacknowledged URLs are automatically re-enqueued

2. Deduplication at Scale

With 60M+ pages and billions of extracted links, the crawler must efficiently determine whether a URL has already been seen or a page's content has already been stored.

Hints to consider:

Consider using a Bloom filter for fast probabilistic URL deduplication with controlled false-positive rates
Think about content-level deduplication using MD5 or SHA-256 hashes of page content to detect pages served at different URLs
Explore how to handle URL canonicalization (trailing slashes, query parameter ordering, fragments) to reduce spurious duplicates
Consider the trade-off between in-memory data structures (fast but memory-bound) and persistent stores (slower but unlimited)

3. Rate Limiting and Politeness

Crawling too aggressively can get your IP blocked and creates ethical concerns. Interviewers expect you to design proper throttling mechanisms.

Hints to consider:

Think about implementing per-domain token buckets that regulate the maximum fetch rate for each hostname
Consider parsing and caching robots.txt for each domain and respecting crawl-delay directives
Explore how to distribute rate limiting state when multiple workers may target the same domain
Think about adaptive rate limiting that backs off when receiving 429 or 503 responses

4. Fault Tolerance and Progress Tracking

Workers are ephemeral and the network is unreliable. The system must make steady forward progress despite continuous failures.

Hints to consider:

Think about checkpointing crawl state periodically so that a restart does not require re-crawling everything
Consider using an acknowledgment protocol where workers confirm successful processing before URLs are removed from the frontier
Explore how to handle poison URLs that consistently cause worker crashes (e.g., malformed HTML, infinite redirect loops)
Think about dead-letter queues for URLs that fail repeatedly, with alerts for human review