Practice/Google/Design a Web Crawler

Design a Web Crawler

System DesignMust

Problem Statement

Design a distributed web crawler that systematically downloads web pages across the internet, extracts links, and feeds them back into the crawl pipeline for further exploration. This is the backbone of search engine indexing, content analysis, and web archiving systems.

The primary challenges are managing the crawl frontier (the queue of URLs to visit), deduplicating URLs and content to avoid wasting resources on already-seen pages, and respecting robots.txt rules and per-domain rate limits to be a polite crawler. The system must handle billions of URLs, recover gracefully from failures, and run continuously as a long-lived pipeline.

Your design should balance crawl breadth (discovering new domains) with crawl depth (fully exploring important sites), while maintaining high throughput and politeness across millions of domains.

Key Requirements

Functional

URL frontier management -- Maintain a prioritized queue of URLs to crawl, with support for breadth-first, depth-first, or priority-based traversal strategies.
Content fetching and parsing -- Download web pages, extract text and outgoing links, and normalize discovered URLs before adding them to the frontier.
Deduplication -- Detect and skip URLs already crawled and content already seen (near-duplicate detection).
Politeness and compliance -- Respect robots.txt directives, enforce per-domain crawl delays, and honor rate limits.

Non-Functional

Scalability -- Crawl 1,000+ pages per second across the distributed system.
Fault tolerance -- No URL is permanently lost if a crawler node fails mid-fetch; the URL re-enters the frontier.
Freshness -- Support recrawl scheduling so important pages are revisited periodically.
Storage efficiency -- Store crawled content compactly with metadata for downstream consumers.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Crawl Frontier Design

The frontier is the heart of the crawler — its design determines crawl order, politeness, and scalability. Hints to consider:

Separate the frontier into a priority queue (which URL to crawl next) and a politeness queue (per-domain rate limiting)
How do you prevent a single large domain from monopolizing crawler resources?
Consider using Kafka with one partition per domain bucket so that each consumer handles a subset of domains
What data structure lets you efficiently check "is this URL already in the frontier?" without loading the entire queue into memory?

2. URL and Content Deduplication

Without deduplication, the crawler wastes resources revisiting the same content through different URLs. Hints to consider:

Use a Bloom filter in Redis to check URL membership with minimal memory — accept a small false-positive rate
How do you handle URL normalization (trailing slashes, query parameter ordering, fragment removal) before dedup checks?
For content deduplication, consider SimHash or MinHash to detect near-duplicate pages
What is the memory cost of a Bloom filter for 10 billion URLs at a 1% false-positive rate?

3. Politeness and Rate Limiting

A well-behaved crawler must respect site owners' wishes and avoid overwhelming any single domain. Hints to consider:

Fetch and cache robots.txt for each domain with a TTL of 24 hours; check it before every crawl
Enforce a minimum delay between consecutive requests to the same domain (e.g., 1 second)
Consider a token bucket per domain stored in Redis — each fetch consumes a token
How do you handle domains that return 429 Too Many Requests or temporary errors?

4. Fault Tolerance and Recovery

Crawlers run for days or weeks; failures are inevitable and the system must recover without losing progress. Hints to consider:

Use at-least-once delivery: a URL is only removed from the frontier after its content is successfully stored
How do you checkpoint the frontier state so that a crashed crawler node can be replaced without recrawling everything?
Consider a lease-based model where a URL is leased to a worker for a timeout period and returns to the queue if not acknowledged
What monitoring signals tell you the crawler is healthy — pages per second, error rate, frontier growth rate?