Practice/Atlassian/Design a Web Crawler
Design a Web Crawler
System DesignMust
Problem Statement
Design a web crawler that takes seed URLs as input, discovers and crawls linked pages, extracts image URLs from each page, and handles duplicate URL filtering while scaling to enterprise-level requirements. Users should be able to submit crawl jobs, define scope and politeness constraints, and query the status and results of their crawls through an API.
This is one of the most frequently asked system design questions at Atlassian. Interviewers frame it as a mix of job scheduling and web crawling: users submit URLs to a frontier queue, receive a job ID, and can query status including the number of URLs parsed. The challenge is orchestrating a high-throughput, fault-tolerant pipeline that respects robots.txt and per-domain rate limits, deduplicates URLs aggressively, avoids crawler traps, and produces incremental results. Expect to demonstrate how you break a long-running workflow into resilient stages, control concurrency at the per-domain level, and make pragmatic storage and bandwidth tradeoffs.
Key Requirements
Functional
- Seed URL submission -- users submit a list of seed URLs to start a crawl and receive a job ID for tracking
- Image URL extraction -- for each crawled page, store the list of discovered image URLs (not the images themselves)
- Scope and politeness -- support configurable max depth, domain allow/deny lists, and per-domain rate limits that respect robots.txt
- Progress monitoring -- users can query crawl progress, retrieve partial results, and see how many URLs have been processed
Non-Functional
- Scalability -- handle crawls spanning millions of pages across thousands of domains with horizontally scalable workers
- Reliability -- survive worker crashes and restarts without losing progress or duplicating work
- Politeness -- enforce per-domain concurrency limits and request intervals to avoid overwhelming target sites
- Latency -- minimize time from crawl submission to first results appearing; surface partial results as they become available
What Interviewers Focus On
Based on real interview experiences at Atlassian, these are the areas interviewers probe most deeply:
1. Crawl Frontier and URL Scheduling
The frontier is the heart of the crawler. Interviewers want to see how you manage the queue of URLs to visit, prioritize them, and prevent unbounded growth.
Hints to consider:
- Use Kafka topics partitioned by domain to maintain per-domain ordering and concurrency control
- Implement a priority queue within each partition based on depth, page importance, or freshness
- Pre-check URLs against a Bloom filter or Redis set before adding to the frontier to avoid duplicate scheduling
- Support backpressure so the frontier does not grow unbounded when workers cannot keep up
2. URL Deduplication and Canonicalization
Without robust deduplication, crawlers enter infinite loops and waste resources. This is a frequent probe area in Atlassian interviews.
Hints to consider:
- Canonicalize URLs by normalizing scheme, removing trailing slashes, sorting query parameters, and resolving relative paths
- Use content hashing (MD5 or SHA-256 of the page body) to detect near-duplicate pages served at different URLs
- Store seen URLs in a distributed set (Redis or DynamoDB with conditional writes) for fast membership checks
- Use a Bloom filter for an efficient first-pass check, falling back to the authoritative store for positives
3. Politeness and Rate Limiting
Crawling too aggressively gets IP addresses blocked and violates site policies. Interviewers expect a concrete, distributed politeness mechanism.
Hints to consider:
- Fetch and cache robots.txt for each domain in Redis with a reasonable TTL (24 hours)
- Implement per-domain token bucket rate limiting using Redis atomic scripts, shared across all workers
- Enforce a minimum crawl delay between consecutive requests to the same domain (typically 1-5 seconds)
- Use separate retry queues with exponential backoff for failed fetches to avoid hammering unhealthy sites
4. Fault Tolerance and Job Management
Crawls run for hours or days. Interviewers assess how you handle partial failures, restarts, and provide visibility into progress.
Hints to consider:
- Checkpoint crawl state (frontier position, processed URL count, extracted results) to durable storage periodically
- Use at-least-once processing with idempotent writes so reprocessing a URL does not create duplicate results
- Implement a dead-letter queue for URLs that repeatedly fail after retry exhaustion
- Expose a job status API that reads from the checkpoint store, showing URLs discovered, processed, and failed