Practice/OpenAI/Multithreaded Web Crawler

Multithreaded Web Crawler

CodingMust

Problem

Design and implement a parallel web crawler that explores all pages within a single domain. Given a starting URL, your crawler should discover and visit all reachable pages that belong to the same domain, using multiple threads to improve performance.

The crawler should use breadth-first traversal to explore the web graph. When visiting a page, it discovers new URLs that link from that page. However, it should only follow links that remain within the original domain — any external domain links must be filtered out.

To simulate web page fetching without actual network calls, you'll receive a graph structure that maps each URL to its list of outgoing links. Your implementation must handle concurrent access to shared data structures safely to avoid race conditions.

Requirements

Extract the domain from the starting URL and only crawl pages within that domain
Use breadth-first search (BFS) to traverse the web graph
Implement thread-safe concurrent crawling with the specified number of worker threads
Ensure each URL is visited at most once (no duplicate crawls)
Protect shared resources (visited set, work queue) from race conditions using appropriate synchronization
Return a sorted list of all URLs that were successfully crawled

Constraints

1 ≤ number of pages ≤ 1000
1 ≤ max_workers ≤ 10
All URLs are valid and properly formatted
The page graph may contain cycles
Multiple URLs may link to the same page
Time complexity should be O(V + E) where V is vertices (pages) and E is edges (links)
Space complexity should be O(V) for storing visited pages and queue

Examples

Example 1:

` Input: start_url = "http://example.com/home" page_graph = { "http://example.com/home": ["http://example.com/about"], "http://example.com/about": [] } max_workers = 2

Output: ["http://example.com/about", "http://example.com/home"]

Explanation: Starting from /home, we discover /about. Both are in the same domain. The result is sorted alphabetically. `

Example 2:

` Input: start_url = "http://site.com/page1" page_graph = { "http://site.com/page1": ["http://site.com/page2", "http://other.com/external"], "http://site.com/page2": ["http://site.com/page1"], "http://other.com/external": [] } max_workers = 2

Output: ["http://site.com/page1", "http://site.com/page2"]

Explanation: The external link to other.com is filtered out. The cycle between page1 and page2 is handled by the visited set. `

Example 3:

` Input: start_url = "http://blog.io/a" page_graph = { "http://blog.io/a": ["http://blog.io/b", "http://blog.io/c"], "http://blog.io/b": ["http://blog.io/d"], "http://blog.io/c": ["http://blog.io/d"], "http://blog.io/d": [] } max_workers = 3

Output: ["http://blog.io/a", "http://blog.io/b", "http://blog.io/c", "http://blog.io/d"]

Explanation: All four pages are discovered through BFS traversal. Page d is found via two paths but visited only once. `