Design Web Crawler — Meta

Meta's "Design Web Crawler" interview question focuses on architecting a scalable, distributed system for crawling websites efficiently. It typically tags data engineering, distributed systems, web crawling, system design, backend, and infrastructure, with occasional ML ties for prioritization. No single canonical source provides an exact "full problem statement" verbatim from Meta, as these evolve, but aggregated insights from interview prep sites and breakdowns reveal consistent details.

Problem Statement

Design a distributed web crawler that systematically discovers, fetches, parses, and stores web pages from a starting set of seed URLs. The system must handle billions of pages, respect politeness policies (e.g., robots.txt, rate limits), deduplicate content/URLs, prioritize crawling (e.g., via freshness or relevance scores), and support fault tolerance in a multi-node setup. Output processed page content and metadata for downstream uses like search indexing or ML training.[1][2][4]

Functional Requirements

Accept seed URLs and crawl linked pages recursively up to a configurable depth.
Fetch HTML, parse for new URLs and text/content, extract metadata (e.g., title, links).
Store raw HTML, parsed text, and URL status (e.g., crawled, failed).
Support crawling filters: domains, file types (HTML only), max depth.[2][4][6]

Non-Functional Requirements & Constraints

Scale: Crawl 10 billion pages in under 5 days (~23k pages/sec).
Politeness: Delay between requests per domain (e.g., 1-5 sec), obey robots.txt/Crawl-delay.
Storage: Billions of pages (~1-10KB each) → petabytes; use distributed storage like S3/DynamoDB.
Deduplication: Hash content/URLs to avoid recrawls; track visited set.
Fault tolerance: Retries on failures, no data loss on node crashes.
Freshness: Recrawl based on change frequency (e.g., news sites daily).
Latency: Low end-to-end for high-priority URLs (<1s fetch).[4][8][1][2]

Input/Output Examples

No formal code-like I/O formats exist, but system interfaces are outlined as APIs:

Input (API/Frontier Queue): POST /crawl/start { "seed_urls": ["https://example.com"], "max_depth": 3, "domains": ["example.com", "*.edu"], "rate_limit_per_domain": 10 // reqs/min }

Output (Stored Data/Events):

Raw HTML → S3 bucket: s3://crawler/raw/https://example.com/index.html
Parsed Data → Database (e.g., DynamoDB): | URL | Status | Content_Hash | Last_Crawled | Text_Excerpt | Links_Found | |-----|--------|--------------|--------------|--------------|-------------| | https://example.com | success | sha256:... | 2026-02-02 | "Welcome to..." | ["https://example.com/about"] [1][6]
Events (Pub/Sub): { "url": "...", "status": "crawled", "fetch_time": 1234567890 }[2]

High-Level Components

URL Frontier: Priority queue (Kafka/Redis) for pending URLs, partitioned by domain.
Fetchers: Distributed workers (1000s nodes) using HTTP clients with proxies/rotating IPs.
Parser/Extractor: Processes HTML (e.g., BeautifulSoup), emits new URLs/content.
Storage: S3 for raw, DynamoDB/Cassandra for metadata, Bloom filters for dedup.[6][4][2]