Practice/Anthropic/Leetcode 1242. Web Crawler Multithreaded

Leetcode 1242. Web Crawler Multithreaded

CodingMust

Problem

Build a multithreaded web crawler that explores pages within a single hostname. Given a starting URL and an HTML parser interface, your crawler should discover and return all unique URLs that belong to the same hostname as the starting URL.

The HtmlParser interface provides a method getUrls(url) that returns a list of all URLs found on the page at the given URL. Your implementation should:

Start from the provided URL
Extract all links from each page using the parser
Follow only links that share the same hostname as the starting URL
Utilize multiple threads to crawl pages in parallel
Avoid visiting the same URL multiple times
Return all discovered URLs in any order

Hostname matching rules:

URLs must have the exact same hostname (case-insensitive)
Subdomains are considered different hostnames (e.g., "blog.site.com" ≠ "site.com")
Protocol and port are ignored for hostname comparison

Requirements

Implement a thread-safe crawler that processes multiple URLs concurrently
Extract and compare hostnames correctly from URLs
Maintain a shared set of visited URLs to prevent duplicate processing
Ensure thread safety when accessing shared data structures
Return all discovered URLs that match the starting URL's hostname

Constraints

1 ≤ URLs discovered ≤ 1000
All URLs are valid HTTP/HTTPS URLs
The getUrls method is thread-safe and can be called concurrently
The starting URL will always be valid
You can assume no URL exceeds 200 characters

Examples

Example 1:

` Input: startUrl = "http://shop.example.com/products" htmlParser.getUrls("http://shop.example.com/products") = ["http://shop.example.com/cart", "http://shop.example.com/about"] htmlParser.getUrls("http://shop.example.com/cart") = [] htmlParser.getUrls("http://shop.example.com/about") = ["http://shop.example.com/products"]

Output: ["http://shop.example.com/products", "http://shop.example.com/cart", "http://shop.example.com/about"]

Explanation: All three URLs share the hostname "shop.example.com" and are reachable from the start URL. The cyclic link back to products is handled correctly. `

Example 2:

` Input: startUrl = "http://company.net" htmlParser.getUrls("http://company.net") = ["http://company.net/jobs", "http://news.company.net", "http://partner.com"] htmlParser.getUrls("http://company.net/jobs") = []

Output: ["http://company.net", "http://company.net/jobs"]

Explanation: "http://news.company.net" has a different hostname (subdomain) and "http://partner.com" is an entirely different domain, so both are excluded. `