Practice/OpenAI/Multithreaded Web Crawler
CodingMust
Design and implement a parallel web crawler that explores all pages within a single domain. Given a starting URL, your crawler should discover and visit all reachable pages that belong to the same domain, using multiple threads to improve performance.
The crawler should use breadth-first traversal to explore the web graph. When visiting a page, it discovers new URLs that link from that page. However, it should only follow links that remain within the original domain — any external domain links must be filtered out.
To simulate web page fetching without actual network calls, you'll receive a graph structure that maps each URL to its list of outgoing links. Your implementation must handle concurrent access to shared data structures safely to avoid race conditions.
Example 1:
` Input: start_url = "http://example.com/home" page_graph = { "http://example.com/home": ["http://example.com/about"], "http://example.com/about": [] } max_workers = 2
Output: ["http://example.com/about", "http://example.com/home"]
Explanation: Starting from /home, we discover /about. Both are in the same domain. The result is sorted alphabetically. `
Example 2:
` Input: start_url = "http://site.com/page1" page_graph = { "http://site.com/page1": ["http://site.com/page2", "http://other.com/external"], "http://site.com/page2": ["http://site.com/page1"], "http://other.com/external": [] } max_workers = 2
Output: ["http://site.com/page1", "http://site.com/page2"]
Explanation: The external link to other.com is filtered out. The cycle between page1 and page2 is handled by the visited set. `
Example 3:
` Input: start_url = "http://blog.io/a" page_graph = { "http://blog.io/a": ["http://blog.io/b", "http://blog.io/c"], "http://blog.io/b": ["http://blog.io/d"], "http://blog.io/c": ["http://blog.io/d"], "http://blog.io/d": [] } max_workers = 3
Output: ["http://blog.io/a", "http://blog.io/b", "http://blog.io/c", "http://blog.io/d"]
Explanation: All four pages are discovered through BFS traversal. Page d is found via two paths but visited only once. `