Leetcode 1242. Web Crawler Multithreaded — ServiceTitan

Problem

Build a multithreaded web crawler that explores pages within a single hostname. Given a starting URL and an HTML parser interface, your crawler should discover and return all unique URLs that belong to the same hostname as the starting URL.

The HtmlParser interface provides a method getUrls(url) that returns a list of all URLs found on the page at the given URL. Your implementation should:

Start from the provided URL
Extract all links from each page using the parser
Follow only links that share the same hostname as the starting URL
Utilize multiple threads to crawl pages in parallel
Avoid visiting the same URL multiple times
Return all discovered URLs in any order

Hostname matching rules:

URLs must have the exact same hostname (case-insensitive)
Subdomains are considered different hostnames (e.g., "blog.site.com" ≠ "site.com")
Protocol and port are ignored for hostname comparison

Requirements

Implement a thread-safe crawler that processes multiple URLs concurrently
Extract and compare hostnames correctly from URLs
Maintain a shared set of visited URLs to prevent duplicate processing
Ensure thread safety when accessing shared data structures
Return all discovered URLs that match the starting URL's hostname

Constraints

1 ≤ URLs discovered ≤ 1000
All URLs are valid HTTP/HTTPS URLs
The getUrls method is thread-safe and can be called concurrently
The starting URL will always be valid
You can assume no URL exceeds 200 characters

Examples

Example 1:

` Input: startUrl = "http://shop.example.com/products" htmlParser.getUrls("http://shop.example.com/products") = ["http://shop.example.com/cart", "http://shop.example.com/about"] htmlParser.getUrls("http://shop.example.com/cart") = [] htmlParser.getUrls("http://shop.example.com/about") = ["http://shop.example.com/products"]

Output: ["http://shop.example.com/products", "http://shop.example.com/cart", "http://shop.example.com/about"]

Explanation: All three URLs share the hostname "shop.example.com" and are reachable from the start URL. The cyclic link back to products is handled correctly. `

Example 2:

` Input: startUrl = "http://company.net" htmlParser.getUrls("http://company.net") = ["http://company.net/jobs", "http://news.company.net", "http://partner.com"] htmlParser.getUrls("http://company.net/jobs") = []

Output: ["http://company.net", "http://company.net/jobs"]

Explanation: "http://news.company.net" has a different hostname (subdomain) and "http://partner.com" is an entirely different domain, so both are excluded. `

Hint 1: Thread Safety Use a thread-safe set to track visited URLs and a lock to synchronize access when checking and adding new URLs. This prevents race conditions where multiple threads might try to visit the same URL simultaneously.

Hint 2: Hostname Extraction Parse URLs to extract the hostname component. Python's urllib.parse module provides utilities for this. Remember that hostnames should be compared case-insensitively and don't include the protocol or port.

Hint 3: Thread Coordination Consider using a queue or a set of threads that need to be joined. Each thread should process one URL, extract its links, spawn new threads for unvisited URLs with matching hostnames, and wait for child threads to complete.

Full Solution `` Approach:

Hostname Extraction: Parse the starting URL to extract its hostname for filtering. Use urlparse to reliably extract the hostname component and normalize it to lowercase for case-insensitive comparison.

Thread-Safe Visited Set: Maintain a set of visited URLs protected by a lock. Before crawling any URL, check if it's already visited using thread-safe operations (lock acquisition, check, add, lock release).

Recursive Parallel Crawling: For each URL, create a thread that:

Fetches all links from the current page

Filters links to keep only those with matching hostnames

Spawns new threads for unvisited URLs

Waits for all child threads to complete (join)

Thread Coordination: Each thread joins all its child threads before completing. This ensures the main thread waits until the entire crawl is complete before returning results.

Time Complexity: O(N) where N is the number of unique URLs, as each URL is visited exactly once.

Space Complexity: O(N) for the visited set plus O(N) for thread stack space in the worst case of a deep crawl tree.

Thread Safety Considerations: The critical section is when checking and adding URLs to the visited set. The lock ensures that even if multiple threads discover the same URL simultaneously, only one will proceed to crawl it.