Practice/Bloomberg/Design a Web Crawler with Limited Communication
Design a Web Crawler with Limited Communication
System DesignMust
Problem Statement
Design a distributed web crawler that uses 10,000 machines to crawl 1 billion URLs without relying on any centralized components -- no master nodes, centralized databases, managed queues, or cloud services. Each machine has substantial CPU and local storage but communication between nodes is constrained and unreliable. The system must discover, fetch, parse, and store web pages while respecting robots.txt policies and per-host rate limits across the entire fleet.
This variant of the web crawler problem is particularly relevant to Bloomberg's infrastructure culture, where large-scale data collection pipelines operate across distributed clusters. The core challenge is decentralized coordination: how do 10,000 independent machines agree on which URLs to crawl, avoid duplicating work, enforce politeness constraints consistently, and recover from failures without a single coordinator? You need deterministic work assignment, efficient deduplication at scale, and graceful degradation when nodes fail or network partitions occur.
Key Requirements
Functional
- Seed-based crawl initiation -- start a crawl from one or more seed URLs with configurable scope (domains, depth limits, file type filters) and propagate discovered links across the fleet
- Politeness enforcement -- comply with robots.txt directives and enforce per-host rate limits (e.g., one request per second per domain) across all 10,000 machines without centralized coordination
- URL deduplication -- ensure each URL is crawled at most once per crawl cycle, even when the same link is discovered by multiple machines simultaneously
- Progress monitoring and recrawl -- track crawl progress, support pause and resume, and schedule recrawls with configurable freshness policies for previously visited pages
Non-Functional
- Scalability -- utilize all 10,000 machines effectively to crawl 1 billion URLs within days, with linear throughput scaling as machines are added
- Reliability -- tolerate individual node failures (up to 5 percent of the fleet) and network partitions without losing significant crawl progress or duplicating substantial work
- Efficiency -- minimize inter-node communication bandwidth by batching transfers and using compact data structures for deduplication state
- Consistency -- guarantee that politeness constraints are never violated even during transient failures; accept occasional duplicate crawls as an acceptable trade-off
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Decentralized Work Assignment and URL Partitioning
Without a master node, the fleet needs a deterministic scheme to decide which machine is responsible for which URLs. Interviewers want to see how you partition the URL space, handle rebalancing when machines join or leave, and avoid hotspots on popular domains.
Hints to consider:
- Use consistent hashing on the domain or hostname to assign each domain to a specific machine, ensuring all URLs from the same host are crawled by one node (which simplifies rate limiting)
- Maintain a hash ring where each physical machine owns multiple virtual nodes to improve load distribution and reduce hotspot probability
- When a machine fails, its neighbors on the ring absorb its partition; when it recovers, it reclaims ownership by replaying its local checkpoint
- For extremely popular domains (millions of pages), allow the owning machine to sub-delegate page-level crawling to peers while retaining rate-limit authority
2. Deduplication Without a Central Database
Deduplicating 1 billion URLs across 10,000 machines without a shared database requires careful design. Interviewers assess whether you can manage distributed state efficiently with bounded memory.
Hints to consider:
- Each machine maintains a Bloom filter for URLs it has seen, providing space-efficient approximate membership testing with configurable false positive rates
- When Machine A discovers a URL that hashes to Machine B's partition, it batches the URL in a local buffer and periodically ships batches to Machine B via pull-based transfers
- Machine B checks incoming URLs against its local Bloom filter and on-disk URL store before adding them to its crawl frontier
- Periodically compact the on-disk URL store using LSM-tree style merging to reclaim space from completed or expired entries
3. Politeness and Rate Limiting in a Distributed Fleet
Violating robots.txt or flooding a host with requests can get the entire fleet banned. Interviewers expect you to show how per-host rate limits are enforced consistently without centralized coordination.
Hints to consider:
- Since consistent hashing assigns all URLs for a given host to one machine, that machine alone is responsible for rate-limiting requests to that host -- no distributed locking needed
- Maintain per-host queues with token bucket rate limiters on each machine, spacing requests according to robots.txt crawl-delay directives
- Cache robots.txt files locally with TTL-based refresh (e.g., re-fetch every 24 hours) and respect all Disallow rules before adding URLs to the crawl frontier
- Implement backoff logic that increases delay when servers respond with 429 (Too Many Requests) or 503 (Service Unavailable) status codes
4. Fault Tolerance and Checkpoint Recovery
With 10,000 machines running for days, failures are guaranteed. Interviewers probe how you handle node crashes, network partitions, and ensure forward progress without losing completed work.
Hints to consider:
- Each machine periodically checkpoints its crawl state (frontier queue position, Bloom filter, completed URL count) to local disk in an append-only log format
- On restart, a machine replays its checkpoint to restore state and resumes crawling from where it left off, with idempotent fetch operations preventing duplicate page storage
- Implement heartbeat-based failure detection using gossip protocol among neighboring machines on the hash ring; when a machine is declared dead, its neighbors temporarily absorb its URL partition
- Use anti-entropy synchronization where machines periodically exchange Bloom filter digests with neighbors to detect and reconcile missed URLs from partition events
Suggested Approach
Step 1: Clarify Requirements
Start by confirming the constraints and priorities. Ask how strictly the "no centralized components" rule is enforced -- is a lightweight coordination service like ZooKeeper acceptable, or must everything be purely peer-to-peer? Clarify the network bandwidth budget between machines and whether there is a shared filesystem or only local storage. Determine how deep the crawl should go (number of hops from seed URLs), whether you need to store full page content or just URLs and metadata, and what the target crawl completion time is. Confirm whether the system needs to handle dynamic content (JavaScript rendering) or only static HTML.
Step 2: High-Level Architecture
Sketch the components on each of the 10,000 machines:
URL Router: Uses consistent hashing on the hostname to determine which machine owns each discovered URL. Discovered URLs are buffered locally and shipped in batches to the owning machine. Each machine runs the same routing logic using a shared, gossip-replicated hash ring configuration.
Crawl Frontier: A priority queue of URLs to fetch, organized by host with per-host sub-queues. The frontier enforces rate limits using token buckets and prioritizes URLs by discovery depth, estimated page importance, and freshness requirements.
Fetcher: HTTP client that downloads pages, respects robots.txt directives, handles redirects, retries transient errors with exponential backoff, and writes raw page content to local disk storage.
Parser and Link Extractor: Parses downloaded HTML to extract outgoing links, normalizes URLs (resolve relative paths, remove fragments, canonicalize query parameters), and routes newly discovered URLs through the URL Router for deduplication and assignment.
Dedup Store: A combination of Bloom filter (fast approximate check) and on-disk LSM-tree (authoritative store) that tracks all URLs assigned to this machine. Incoming URLs are checked against both before being added to the frontier.
Gossip Agent: Participates in a protocol that disseminates hash ring membership, aggregate crawl progress statistics, and failure notifications across the fleet without centralized infrastructure.