Practice/Lyft/Design Wikipedia Crawler

Design Wikipedia Crawler

System DesignMust

Problem Statement

Design a distributed web crawler that can systematically discover, fetch, and store all pages on Wikipedia at scale. The system must handle tens of millions of pages, respect crawl rate limits to avoid overloading Wikipedia's servers, detect and process page updates efficiently, and store the crawled content in a searchable format for downstream consumers such as search indexes or analytics pipelines.

This problem tests your understanding of distributed systems coordination, politeness policies, URL frontier management, and fault tolerance in long-running batch processes. At Lyft, interviewers use this question to evaluate how you decompose a large-scale data ingestion problem into parallelizable components, handle deduplication of URLs, and design for graceful recovery when individual workers fail mid-crawl.

The key tension is between crawl speed and politeness. Wikipedia has millions of pages that change at different rates, and your system needs to prioritize fresh content while respecting robots.txt directives and rate limits. You also need to handle edge cases like redirect chains, duplicate content across URLs, and pages that change frequently versus those that are essentially static.

Key Requirements

Functional

URL discovery and frontier management -- Maintain a prioritized queue of URLs to crawl, seeded from Wikipedia's sitemap and expanded by extracting links from fetched pages
Content fetching and parsing -- Download HTML pages, extract text content and metadata, and identify outgoing links for further crawling
Deduplication -- Detect and skip duplicate URLs (including canonicalization of different URL formats pointing to the same page) and duplicate content
Incremental re-crawling -- Track when pages were last crawled and prioritize re-fetching pages that are likely to have changed based on historical update frequency

Non-Functional

Scalability -- Crawl 50 million pages within 24 hours using a horizontally scalable worker fleet
Politeness -- Enforce per-domain rate limits (no more than 1 request per second to any single host) and respect robots.txt directives
Reliability -- Tolerate worker failures without losing progress; resume interrupted crawls from the last checkpoint
Storage efficiency -- Compress and deduplicate stored content to minimize storage costs while maintaining fast retrieval

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. URL Frontier Design and Prioritization

The URL frontier is the heart of any web crawler. Interviewers want to see how you balance breadth-first discovery with priority-based scheduling, and how you prevent the frontier from growing unboundedly or becoming a bottleneck.

Hints to consider:

Use a two-level queue: a priority queue that selects which URLs to crawl next based on importance, and per-host queues that enforce politeness delays
Implement priority scoring based on page rank, last-modified headers, historical change frequency, and depth from seed URLs
Shard the frontier across multiple nodes by hashing the domain name to ensure per-host politeness is enforced locally
Use a Bloom filter or similar probabilistic data structure to cheaply check whether a URL has already been enqueued

2. Distributed Coordination and Partition Strategy

With hundreds of workers crawling concurrently, you need a strategy to divide work without conflicts, minimize coordination overhead, and ensure no pages are missed or crawled redundantly.

Hints to consider:

Partition URLs by domain hash so each worker is responsible for a set of domains, which naturally enforces per-domain rate limits
Use a central coordinator (or consistent hashing ring) to assign domain partitions to workers and rebalance when workers join or leave
Workers pull batches of URLs from their assigned partition rather than individual URLs to reduce coordination frequency
Implement heartbeat-based failure detection so abandoned partitions are reassigned within seconds

3. Deduplication at Scale

Wikipedia has many URLs that resolve to the same content (redirects, language variants, URL parameters). Interviewers expect you to handle both URL-level and content-level deduplication efficiently.

Hints to consider:

Canonicalize URLs by normalizing scheme, removing trailing slashes, sorting query parameters, and following redirects to their final destination
Use content fingerprinting (SimHash or MinHash) to detect near-duplicate pages that differ only in boilerplate or timestamps
Store URL-to-content-hash mappings in a distributed key-value store like Redis or DynamoDB for fast lookup during crawl
Maintain a Bloom filter with billions of entries for fast negative lookups on previously seen URLs

4. Fault Tolerance and Checkpointing

A full crawl of Wikipedia takes hours or days. Workers will crash, network requests will fail, and the system must continue making progress without re-crawling everything from scratch.

Hints to consider:

Persist the frontier state periodically to durable storage so a crashed coordinator can resume from the last checkpoint
Use visibility timeouts on frontier entries: if a worker claims a URL batch but fails to report completion, the batch becomes available again
Store crawl results incrementally as each page is fetched rather than batching writes, so completed work survives worker crashes
Implement exponential backoff with jitter for transient HTTP errors (429, 503) and permanent failure detection for consistently failing URLs

Practice/Lyft/Design Wikipedia Crawler

Design Wikipedia Crawler

System DesignMust

Problem Statement

Key Requirements

Functional

URL discovery and frontier management -- Maintain a prioritized queue of URLs to crawl, seeded from Wikipedia's sitemap and expanded by extracting links from fetched pages
Content fetching and parsing -- Download HTML pages, extract text content and metadata, and identify outgoing links for further crawling
Deduplication -- Detect and skip duplicate URLs (including canonicalization of different URL formats pointing to the same page) and duplicate content
Incremental re-crawling -- Track when pages were last crawled and prioritize re-fetching pages that are likely to have changed based on historical update frequency

Non-Functional

Scalability -- Crawl 50 million pages within 24 hours using a horizontally scalable worker fleet
Politeness -- Enforce per-domain rate limits (no more than 1 request per second to any single host) and respect robots.txt directives
Reliability -- Tolerate worker failures without losing progress; resume interrupted crawls from the last checkpoint
Storage efficiency -- Compress and deduplicate stored content to minimize storage costs while maintaining fast retrieval

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. URL Frontier Design and Prioritization

Hints to consider:

Use a two-level queue: a priority queue that selects which URLs to crawl next based on importance, and per-host queues that enforce politeness delays
Implement priority scoring based on page rank, last-modified headers, historical change frequency, and depth from seed URLs
Shard the frontier across multiple nodes by hashing the domain name to ensure per-host politeness is enforced locally
Use a Bloom filter or similar probabilistic data structure to cheaply check whether a URL has already been enqueued

2. Distributed Coordination and Partition Strategy

With hundreds of workers crawling concurrently, you need a strategy to divide work without conflicts, minimize coordination overhead, and ensure no pages are missed or crawled redundantly.

Hints to consider:

Partition URLs by domain hash so each worker is responsible for a set of domains, which naturally enforces per-domain rate limits
Use a central coordinator (or consistent hashing ring) to assign domain partitions to workers and rebalance when workers join or leave
Workers pull batches of URLs from their assigned partition rather than individual URLs to reduce coordination frequency
Implement heartbeat-based failure detection so abandoned partitions are reassigned within seconds

3. Deduplication at Scale

Wikipedia has many URLs that resolve to the same content (redirects, language variants, URL parameters). Interviewers expect you to handle both URL-level and content-level deduplication efficiently.

Hints to consider:

Canonicalize URLs by normalizing scheme, removing trailing slashes, sorting query parameters, and following redirects to their final destination
Use content fingerprinting (SimHash or MinHash) to detect near-duplicate pages that differ only in boilerplate or timestamps
Store URL-to-content-hash mappings in a distributed key-value store like Redis or DynamoDB for fast lookup during crawl
Maintain a Bloom filter with billions of entries for fast negative lookups on previously seen URLs

4. Fault Tolerance and Checkpointing

A full crawl of Wikipedia takes hours or days. Workers will crash, network requests will fail, and the system must continue making progress without re-crawling everything from scratch.

Hints to consider:

Persist the frontier state periodically to durable storage so a crashed coordinator can resume from the last checkpoint
Use visibility timeouts on frontier entries: if a worker claims a URL batch but fails to report completion, the batch becomes available again
Store crawl results incrementally as each page is fetched rather than batching writes, so completed work survives worker crashes
Implement exponential backoff with jitter for transient HTTP errors (429, 503) and permanent failure detection for consistently failing URLs