Design a Web Crawler — ZoomInfo

Problem Statement

Design a web crawler that takes seed URLs as input, discovers and crawls linked pages, extracts image URLs from web pages, and handles duplicate URL filtering while scaling to enterprise-level requirements. Think of simplified versions of Googlebot or enterprise site scanners that inventory content and assets across the web.

The crawler starts from a set of seed URLs, fetches each page, extracts links and image source URLs, and continues exploring within a defined scope. Users submit seed URLs, configure crawl parameters (depth limits, domain allow/deny lists, rate limits), and retrieve results mapping each crawled page to its discovered image URLs. The system must handle millions of pages, respect robots.txt politeness rules, avoid crawler traps and infinite loops, and produce incremental, queryable results as crawling progresses.

At ZoomInfo scale, interviewers want to see how you break a long-running workflow into resilient pipeline stages, apply backpressure, control concurrency at the right granularity (per domain), and make pragmatic storage and bandwidth trade-offs.

Key Requirements

Functional

Seed URL submission -- users can submit a list of seed URLs to initiate a crawl job and receive a job identifier for tracking
Image URL extraction -- for each crawled page, the system extracts and stores all image URLs found on that page, accessible via API
Crawl scope configuration -- users can define constraints including maximum depth, domain allow/deny lists, and per-domain rate limits
Progress monitoring -- users can query crawl progress in real time and fetch partial results as pages are processed

Non-Functional

Scalability -- support crawling billions of pages with tens of thousands of concurrent fetches distributed across a worker fleet
Reliability -- tolerate worker failures, network errors, and site outages without losing crawl progress or duplicating work
Politeness -- enforce robots.txt rules and per-domain rate limits to avoid overloading target sites or triggering IP bans
Throughput -- sustain thousands of pages per second while maintaining orderly scheduling and deduplication

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Crawl Frontier Management

The frontier is the data structure that decides which URLs to fetch next. Naive approaches that dump all discovered URLs into a single queue collapse under scale and fail to enforce per-domain politeness.

Hints to consider:

Partition the frontier by domain so each domain has its own sub-queue, enabling independent rate limiting and priority ordering
Use a priority scoring mechanism combining factors like page depth, domain authority, and freshness to decide fetch order
Implement back-queue and front-queue separation: front queues handle priority ordering while back queues enforce per-domain rate limits
Consider persistent frontier storage (Kafka topics partitioned by domain hash) so progress survives scheduler restarts

2. URL Deduplication and Canonicalization

Without aggressive deduplication, the crawler will revisit the same content endlessly. URL normalization is surprisingly tricky because the same page can appear under many different URLs.

Hints to consider:

Normalize URLs by lowercasing the scheme and host, resolving relative paths, removing default ports, sorting query parameters, and stripping tracking fragments
Use a Bloom filter for fast approximate membership checks before hitting the canonical URL store, dramatically reducing database lookups
Compute content hashes (e.g., SimHash) to detect near-duplicate pages served under different URLs
Handle crawler traps like infinite calendar pages or session-ID URLs by enforcing maximum depth and maximum URLs per domain

3. Pipeline Stage Design and Failure Handling

Crawling is a multi-step pipeline: schedule URL, fetch page, parse and extract content, persist results, and enqueue child URLs. Each stage can fail independently and needs its own retry and error handling strategy.

Hints to consider:

Break the pipeline into discrete stages connected by message queues, allowing each stage to scale independently and retry without restarting the entire pipeline
Distinguish transient failures (network timeouts, HTTP 503) from permanent failures (404, robots.txt denial) and apply appropriate retry policies
Use exponential backoff with jitter for transient failures to avoid thundering herds against recovering servers
Route permanently failed URLs to a dead-letter queue for inspection rather than retrying indefinitely

4. Politeness and Rate Limiting

A production crawler must respect website owners. Violating rate limits leads to IP bans, legal issues, and unreliable results. Interviewers probe whether you can enforce politeness in a distributed system where multiple workers might target the same domain.

Hints to consider:

Fetch and cache robots.txt for each domain before crawling, respecting disallow rules and crawl-delay directives
Implement distributed per-domain token buckets (e.g., using Redis atomic operations) so all workers collectively respect rate limits
Add minimum delay between consecutive requests to the same domain, even across different workers
Monitor response codes and dynamically throttle domains that return 429 (Too Many Requests) or show degraded response times

Suggested Approach

Step 1: Clarify Requirements

Confirm the expected scale: how many seed URLs, what is the target crawl size (millions or billions of pages), and what time frame is acceptable. Ask whether the system needs to extract only image URLs or also other assets like scripts and stylesheets. Clarify whether crawls are one-time jobs or recurring with incremental updates. Ask about depth limits, domain scope, and whether cross-domain crawling is in scope. Confirm the required freshness of results and whether users need real-time streaming results or periodic batch updates.

Step 2: High-Level Architecture

Sketch the core components: a Crawl Manager API for job submission, configuration, and progress queries; a URL Frontier service managing prioritized, partitioned queues of URLs to fetch; a fleet of Fetcher Workers that pull URLs from the frontier, download pages, and respect politeness rules; a Parser Service that extracts links and image URLs from raw HTML; a URL Store (DynamoDB or PostgreSQL) maintaining the canonical set of known URLs with deduplication; an Object Store (S3) for raw HTML content if archival is needed; and a Results Store mapping crawled pages to extracted image URLs. Show the data flow: seed URLs enter the frontier, fetchers pull and download pages, parsers extract links and images, new links are deduplicated and fed back into the frontier, and image URL mappings are written to the results store.

Step 3: Deep Dive on Frontier and Deduplication

Walk through the full lifecycle of a URL from discovery to completion. A parser extracts a link from a crawled page. The link is first normalized (lowercase host, resolve path, strip fragments, sort query params). The normalized URL is checked against a Bloom filter for quick rejection of known URLs. If it passes, a conditional write to the URL store confirms it is truly new. The URL is then scored for priority and inserted into the appropriate domain-partitioned frontier queue. A fetcher worker pulls the URL when the domain's rate limit permits, fetches the page, and publishes the raw content for parsing. Discuss the Bloom filter's false-positive trade-off: a small false-positive rate (under 1 percent) is acceptable because it only causes occasional missed pages, while the memory savings are enormous compared to an exact set.

Step 4: Address Secondary Concerns

Cover failure recovery: checkpointing frontier state to durable storage so crawls resume after scheduler crashes. Discuss DNS resolution caching to avoid hammering DNS servers. Address content-type filtering to skip non-HTML resources early. Cover monitoring: track pages crawled per second, frontier queue depth per domain, error rates by HTTP status code, and worker utilization. Discuss storage optimization: compress raw HTML before storing, use TTLs to expire stale crawl data, and partition results by crawl job for easy cleanup. Mention how to handle JavaScript-rendered pages if in scope, such as using headless browsers for a subset of pages that return minimal content without JavaScript execution.

Related Learning

Deepen your understanding of the patterns used in this problem:

Message Queues -- the crawl frontier is fundamentally a distributed queue with priority and partitioning
Databases -- choosing the right store for URL deduplication and crawl results
Rate Limiters -- per-domain rate limiting is central to polite crawling at scale
Blob Storage -- storing raw HTML content and large crawl artifacts efficiently

Problem Statement

Key Requirements

Functional

Seed URL submission -- users can submit a list of seed URLs to initiate a crawl job and receive a job identifier for tracking
Image URL extraction -- for each crawled page, the system extracts and stores all image URLs found on that page, accessible via API
Crawl scope configuration -- users can define constraints including maximum depth, domain allow/deny lists, and per-domain rate limits
Progress monitoring -- users can query crawl progress in real time and fetch partial results as pages are processed

Non-Functional

Scalability -- support crawling billions of pages with tens of thousands of concurrent fetches distributed across a worker fleet
Reliability -- tolerate worker failures, network errors, and site outages without losing crawl progress or duplicating work
Politeness -- enforce robots.txt rules and per-domain rate limits to avoid overloading target sites or triggering IP bans
Throughput -- sustain thousands of pages per second while maintaining orderly scheduling and deduplication

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Crawl Frontier Management

Hints to consider:

Partition the frontier by domain so each domain has its own sub-queue, enabling independent rate limiting and priority ordering
Use a priority scoring mechanism combining factors like page depth, domain authority, and freshness to decide fetch order
Implement back-queue and front-queue separation: front queues handle priority ordering while back queues enforce per-domain rate limits
Consider persistent frontier storage (Kafka topics partitioned by domain hash) so progress survives scheduler restarts

2. URL Deduplication and Canonicalization

Without aggressive deduplication, the crawler will revisit the same content endlessly. URL normalization is surprisingly tricky because the same page can appear under many different URLs.

Hints to consider:

Normalize URLs by lowercasing the scheme and host, resolving relative paths, removing default ports, sorting query parameters, and stripping tracking fragments
Use a Bloom filter for fast approximate membership checks before hitting the canonical URL store, dramatically reducing database lookups
Compute content hashes (e.g., SimHash) to detect near-duplicate pages served under different URLs
Handle crawler traps like infinite calendar pages or session-ID URLs by enforcing maximum depth and maximum URLs per domain

3. Pipeline Stage Design and Failure Handling

Hints to consider:

Break the pipeline into discrete stages connected by message queues, allowing each stage to scale independently and retry without restarting the entire pipeline
Distinguish transient failures (network timeouts, HTTP 503) from permanent failures (404, robots.txt denial) and apply appropriate retry policies
Use exponential backoff with jitter for transient failures to avoid thundering herds against recovering servers
Route permanently failed URLs to a dead-letter queue for inspection rather than retrying indefinitely

4. Politeness and Rate Limiting

Hints to consider:

Fetch and cache robots.txt for each domain before crawling, respecting disallow rules and crawl-delay directives
Implement distributed per-domain token buckets (e.g., using Redis atomic operations) so all workers collectively respect rate limits
Add minimum delay between consecutive requests to the same domain, even across different workers
Monitor response codes and dynamically throttle domains that return 429 (Too Many Requests) or show degraded response times

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Frontier and Deduplication

Step 4: Address Secondary Concerns

Related Learning

Deepen your understanding of the patterns used in this problem:

Message Queues -- the crawl frontier is fundamentally a distributed queue with priority and partitioning
Databases -- choosing the right store for URL deduplication and crawl results
Rate Limiters -- per-domain rate limiting is central to polite crawling at scale
Blob Storage -- storing raw HTML content and large crawl artifacts efficiently