Design a web crawler that takes seed URLs as input, discovers and crawls linked pages, extracts image URLs from web pages, and handles duplicate URL filtering while scaling to enterprise-level requirements. Think of simplified versions of Googlebot or enterprise site scanners that inventory content and assets across the web.
The crawler starts from a set of seed URLs, fetches each page, extracts links and image source URLs, and continues exploring within a defined scope. Users submit seed URLs, configure crawl parameters (depth limits, domain allow/deny lists, rate limits), and retrieve results mapping each crawled page to its discovered image URLs. The system must handle millions of pages, respect robots.txt politeness rules, avoid crawler traps and infinite loops, and produce incremental, queryable results as crawling progresses.
At ZoomInfo scale, interviewers want to see how you break a long-running workflow into resilient pipeline stages, apply backpressure, control concurrency at the right granularity (per domain), and make pragmatic storage and bandwidth trade-offs.
Based on real interview experiences, these are the areas interviewers probe most deeply:
The frontier is the data structure that decides which URLs to fetch next. Naive approaches that dump all discovered URLs into a single queue collapse under scale and fail to enforce per-domain politeness.
Hints to consider:
Without aggressive deduplication, the crawler will revisit the same content endlessly. URL normalization is surprisingly tricky because the same page can appear under many different URLs.
Hints to consider:
Crawling is a multi-step pipeline: schedule URL, fetch page, parse and extract content, persist results, and enqueue child URLs. Each stage can fail independently and needs its own retry and error handling strategy.
Hints to consider:
A production crawler must respect website owners. Violating rate limits leads to IP bans, legal issues, and unreliable results. Interviewers probe whether you can enforce politeness in a distributed system where multiple workers might target the same domain.
Hints to consider:
Confirm the expected scale: how many seed URLs, what is the target crawl size (millions or billions of pages), and what time frame is acceptable. Ask whether the system needs to extract only image URLs or also other assets like scripts and stylesheets. Clarify whether crawls are one-time jobs or recurring with incremental updates. Ask about depth limits, domain scope, and whether cross-domain crawling is in scope. Confirm the required freshness of results and whether users need real-time streaming results or periodic batch updates.
Sketch the core components: a Crawl Manager API for job submission, configuration, and progress queries; a URL Frontier service managing prioritized, partitioned queues of URLs to fetch; a fleet of Fetcher Workers that pull URLs from the frontier, download pages, and respect politeness rules; a Parser Service that extracts links and image URLs from raw HTML; a URL Store (DynamoDB or PostgreSQL) maintaining the canonical set of known URLs with deduplication; an Object Store (S3) for raw HTML content if archival is needed; and a Results Store mapping crawled pages to extracted image URLs. Show the data flow: seed URLs enter the frontier, fetchers pull and download pages, parsers extract links and images, new links are deduplicated and fed back into the frontier, and image URL mappings are written to the results store.
Walk through the full lifecycle of a URL from discovery to completion. A parser extracts a link from a crawled page. The link is first normalized (lowercase host, resolve path, strip fragments, sort query params). The normalized URL is checked against a Bloom filter for quick rejection of known URLs. If it passes, a conditional write to the URL store confirms it is truly new. The URL is then scored for priority and inserted into the appropriate domain-partitioned frontier queue. A fetcher worker pulls the URL when the domain's rate limit permits, fetches the page, and publishes the raw content for parsing. Discuss the Bloom filter's false-positive trade-off: a small false-positive rate (under 1 percent) is acceptable because it only causes occasional missed pages, while the memory savings are enormous compared to an exact set.
Cover failure recovery: checkpointing frontier state to durable storage so crawls resume after scheduler crashes. Discuss DNS resolution caching to avoid hammering DNS servers. Address content-type filtering to skip non-HTML resources early. Cover monitoring: track pages crawled per second, frontier queue depth per domain, error rates by HTTP status code, and worker utilization. Discuss storage optimization: compress raw HTML before storing, use TTLs to expire stale crawl data, and partition results by crawl job for easy cleanup. Mention how to handle JavaScript-rendered pages if in scope, such as using headless browsers for a subset of pages that return minimal content without JavaScript execution.
Deepen your understanding of the patterns used in this problem: