Web Crawler - Concurrent Version

[ OK ] 594 — full content available

[ INFO ] category: Coding difficulty: medium freq: high first seen: 2026-01-15

[MEDIUM][CODING][HIGH]BFSConcurrencyThreadingAsyncio

$ cat problem.md

Anthropic's Web Crawler - Concurrent Version is a coding interview problem focusing on implementing a multithreaded web crawler that performs BFS traversal within the same domain. It appears primarily on platforms like Hello Interview and is tagged with BFS, concurrency, threading/asyncio.[1]

Problem Statement

Build a web crawler starting from a seed URL that uses breadth-first search (BFS) to discover and fetch all reachable pages within the same domain. Implement concurrent fetching with multiple threads (typically 4-5 workers) while ensuring thread safety for shared structures like the URL queue and visited set. The crawler should skip external domains—for example, from https://example.com it visits https://example.com/about but ignores https://other.com—and avoid revisiting URLs.[2][1]

Key Requirements

BFS order: Process URLs level-by-level using a queue.
Domain restriction: Only crawl same domain as start URL (normalize with urllib.parse for consistency).
Concurrency: Use ThreadPoolExecutor or threading with 4-5 workers for I/O-bound fetching.
Thread safety: Protect visited set with locks; queue.Queue is inherently thread-safe.
Termination: Signal workers when queue empties and no active crawling remains.
Extensions: Often discuss asyncio for efficiency or GIL impacts vs. multiprocessing.[1][2]

Constraints

No strict numerical limits specified, but typical guidelines include:

Crawl only same domain.
BFS traversal.
Thread-safe shared queue/visited set.
Concurrent URL fetching (avoid same URL twice).
Handle normalization (e.g., www. vs non-www, trailing slashes).
Practical bounds: Consider max depth/queue size to prevent unbounded growth; timeouts for fetches.[1]

Input/Output Examples

Detailed examples are sparse across sources, but a conceptual walkthrough from illustrates:

Input: start_url = "https://example.com", max_workers=4
Process:
1. Queue starts with "https://example.com".
2. Workers fetch concurrently, parse links (e.g., via BeautifulSoup), add same-domain unvisited URLs to queue.
3. Visited set prevents duplicates (locked check-add).
Output: List of all visited same-domain URLs in BFS order, e.g., ["https://example.com", "https://example.com/about", "https://example.com/contact"]. No formal test cases found; emphasis is on code implementation and edge cases like race conditions or network errors.[1]

This matches LeetCode 1242 (Multithreaded Web Crawler Multithreaded) in spirit but emphasizes production aspects like domain filtering and Python concurrency primitives over exact HTML parsing APIs.[6][1]

user@intervues:~/anthropic$