Anthropic's Web Crawler - Concurrent Version is a coding interview problem focusing on implementing a multithreaded web crawler that performs BFS traversal within the same domain. It appears primarily on platforms like Hello Interview and is tagged with BFS, concurrency, threading/asyncio.[1]
Problem Statement
Build a web crawler starting from a seed URL that uses breadth-first search (BFS) to discover and fetch all reachable pages within the same domain. Implement concurrent fetching with multiple threads (typically 4-5 workers) while ensuring thread safety for shared structures like the URL queue and visited set. The crawler should skip external domains—for example, from https://example.com it visits https://example.com/about but ignores https://other.com—and avoid revisiting URLs.[2][1]
Key Requirements
BFS order: Process URLs level-by-level using a queue.
Domain restriction: Only crawl same domain as start URL (normalize with urllib.parse for consistency).
Concurrency: Use ThreadPoolExecutor or threading with 4-5 workers for I/O-bound fetching.
Thread safety: Protect visited set with locks; queue.Queue is inherently thread-safe.
Termination: Signal workers when queue empties and no active crawling remains.
Extensions: Often discuss asyncio for efficiency or GIL impacts vs. multiprocessing.[1][2]
Constraints
No strict numerical limits specified, but typical guidelines include:
Crawl only same domain.
BFS traversal.
Thread-safe shared queue/visited set.
Concurrent URL fetching (avoid same URL twice).
Handle normalization (e.g., www. vs non-www, trailing slashes).
Practical bounds: Consider max depth/queue size to prevent unbounded growth; timeouts for fetches.[1]
Input/Output Examples
Detailed examples are sparse across sources, but a conceptual walkthrough from illustrates:
Workers fetch concurrently, parse links (e.g., via BeautifulSoup), add same-domain unvisited URLs to queue.
Visited set prevents duplicates (locked check-add).
Output: List of all visited same-domain URLs in BFS order, e.g., ["https://example.com", "https://example.com/about", "https://example.com/contact"].
No formal test cases found; emphasis is on code implementation and edge cases like race conditions or network errors.[1]
This matches LeetCode 1242 (Multithreaded Web Crawler Multithreaded) in spirit but emphasizes production aspects like domain filtering and Python concurrency primitives over exact HTML parsing APIs.[6][1]