Practice/Atlassian/Design a Web Crawler

Design a Web Crawler

System DesignMust

Problem Statement

Design a web crawler that takes seed URLs as input, discovers and crawls linked pages, extracts image URLs from each page, and handles duplicate URL filtering while scaling to enterprise-level requirements. Users should be able to submit crawl jobs, define scope and politeness constraints, and query the status and results of their crawls through an API.

This is one of the most frequently asked system design questions at Atlassian. Interviewers frame it as a mix of job scheduling and web crawling: users submit URLs to a frontier queue, receive a job ID, and can query status including the number of URLs parsed. The challenge is orchestrating a high-throughput, fault-tolerant pipeline that respects robots.txt and per-domain rate limits, deduplicates URLs aggressively, avoids crawler traps, and produces incremental results. Expect to demonstrate how you break a long-running workflow into resilient stages, control concurrency at the per-domain level, and make pragmatic storage and bandwidth tradeoffs.

Key Requirements

Functional

Seed URL submission -- users submit a list of seed URLs to start a crawl and receive a job ID for tracking
Image URL extraction -- for each crawled page, store the list of discovered image URLs (not the images themselves)
Scope and politeness -- support configurable max depth, domain allow/deny lists, and per-domain rate limits that respect robots.txt
Progress monitoring -- users can query crawl progress, retrieve partial results, and see how many URLs have been processed

Non-Functional

Scalability -- handle crawls spanning millions of pages across thousands of domains with horizontally scalable workers
Reliability -- survive worker crashes and restarts without losing progress or duplicating work
Politeness -- enforce per-domain concurrency limits and request intervals to avoid overwhelming target sites
Latency -- minimize time from crawl submission to first results appearing; surface partial results as they become available

What Interviewers Focus On

Based on real interview experiences at Atlassian, these are the areas interviewers probe most deeply:

1. Crawl Frontier and URL Scheduling

The frontier is the heart of the crawler. Interviewers want to see how you manage the queue of URLs to visit, prioritize them, and prevent unbounded growth.

Hints to consider:

Use Kafka topics partitioned by domain to maintain per-domain ordering and concurrency control
Implement a priority queue within each partition based on depth, page importance, or freshness
Pre-check URLs against a Bloom filter or Redis set before adding to the frontier to avoid duplicate scheduling
Support backpressure so the frontier does not grow unbounded when workers cannot keep up

2. URL Deduplication and Canonicalization

Without robust deduplication, crawlers enter infinite loops and waste resources. This is a frequent probe area in Atlassian interviews.

Hints to consider:

Canonicalize URLs by normalizing scheme, removing trailing slashes, sorting query parameters, and resolving relative paths
Use content hashing (MD5 or SHA-256 of the page body) to detect near-duplicate pages served at different URLs
Store seen URLs in a distributed set (Redis or DynamoDB with conditional writes) for fast membership checks
Use a Bloom filter for an efficient first-pass check, falling back to the authoritative store for positives

3. Politeness and Rate Limiting

Crawling too aggressively gets IP addresses blocked and violates site policies. Interviewers expect a concrete, distributed politeness mechanism.

Hints to consider:

Fetch and cache robots.txt for each domain in Redis with a reasonable TTL (24 hours)
Implement per-domain token bucket rate limiting using Redis atomic scripts, shared across all workers
Enforce a minimum crawl delay between consecutive requests to the same domain (typically 1-5 seconds)
Use separate retry queues with exponential backoff for failed fetches to avoid hammering unhealthy sites

4. Fault Tolerance and Job Management

Crawls run for hours or days. Interviewers assess how you handle partial failures, restarts, and provide visibility into progress.

Hints to consider:

Checkpoint crawl state (frontier position, processed URL count, extracted results) to durable storage periodically
Use at-least-once processing with idempotent writes so reprocessing a URL does not create duplicate results
Implement a dead-letter queue for URLs that repeatedly fail after retry exhaustion
Expose a job status API that reads from the checkpoint store, showing URLs discovered, processed, and failed

Practice/Atlassian/Design a Web Crawler

Design a Web Crawler

System DesignMust

Problem Statement

Key Requirements

Functional

Seed URL submission -- users submit a list of seed URLs to start a crawl and receive a job ID for tracking
Image URL extraction -- for each crawled page, store the list of discovered image URLs (not the images themselves)
Scope and politeness -- support configurable max depth, domain allow/deny lists, and per-domain rate limits that respect robots.txt
Progress monitoring -- users can query crawl progress, retrieve partial results, and see how many URLs have been processed

Non-Functional

Scalability -- handle crawls spanning millions of pages across thousands of domains with horizontally scalable workers
Reliability -- survive worker crashes and restarts without losing progress or duplicating work
Politeness -- enforce per-domain concurrency limits and request intervals to avoid overwhelming target sites
Latency -- minimize time from crawl submission to first results appearing; surface partial results as they become available

What Interviewers Focus On

Based on real interview experiences at Atlassian, these are the areas interviewers probe most deeply:

1. Crawl Frontier and URL Scheduling

The frontier is the heart of the crawler. Interviewers want to see how you manage the queue of URLs to visit, prioritize them, and prevent unbounded growth.

Hints to consider:

Use Kafka topics partitioned by domain to maintain per-domain ordering and concurrency control
Implement a priority queue within each partition based on depth, page importance, or freshness
Pre-check URLs against a Bloom filter or Redis set before adding to the frontier to avoid duplicate scheduling
Support backpressure so the frontier does not grow unbounded when workers cannot keep up

2. URL Deduplication and Canonicalization

Without robust deduplication, crawlers enter infinite loops and waste resources. This is a frequent probe area in Atlassian interviews.

Hints to consider:

Canonicalize URLs by normalizing scheme, removing trailing slashes, sorting query parameters, and resolving relative paths
Use content hashing (MD5 or SHA-256 of the page body) to detect near-duplicate pages served at different URLs
Store seen URLs in a distributed set (Redis or DynamoDB with conditional writes) for fast membership checks
Use a Bloom filter for an efficient first-pass check, falling back to the authoritative store for positives

3. Politeness and Rate Limiting

Crawling too aggressively gets IP addresses blocked and violates site policies. Interviewers expect a concrete, distributed politeness mechanism.

Hints to consider:

Fetch and cache robots.txt for each domain in Redis with a reasonable TTL (24 hours)
Implement per-domain token bucket rate limiting using Redis atomic scripts, shared across all workers
Enforce a minimum crawl delay between consecutive requests to the same domain (typically 1-5 seconds)
Use separate retry queues with exponential backoff for failed fetches to avoid hammering unhealthy sites

4. Fault Tolerance and Job Management

Crawls run for hours or days. Interviewers assess how you handle partial failures, restarts, and provide visibility into progress.

Hints to consider:

Checkpoint crawl state (frontier position, processed URL count, extracted results) to durable storage periodically
Use at-least-once processing with idempotent writes so reprocessing a URL does not create duplicate results
Implement a dead-letter queue for URLs that repeatedly fail after retry exhaustion
Expose a job status API that reads from the checkpoint store, showing URLs discovered, processed, and failed