Practice/Anthropic/Leetcode 1242. Web Crawler Multithreaded
CodingMust
Build a multithreaded web crawler that explores pages within a single hostname. Given a starting URL and an HTML parser interface, your crawler should discover and return all unique URLs that belong to the same hostname as the starting URL.
The HtmlParser interface provides a method getUrls(url) that returns a list of all URLs found on the page at the given URL. Your implementation should:
Hostname matching rules:
getUrls method is thread-safe and can be called concurrentlyExample 1:
` Input: startUrl = "http://shop.example.com/products" htmlParser.getUrls("http://shop.example.com/products") = ["http://shop.example.com/cart", "http://shop.example.com/about"] htmlParser.getUrls("http://shop.example.com/cart") = [] htmlParser.getUrls("http://shop.example.com/about") = ["http://shop.example.com/products"]
Output: ["http://shop.example.com/products", "http://shop.example.com/cart", "http://shop.example.com/about"]
Explanation: All three URLs share the hostname "shop.example.com" and are reachable from the start URL. The cyclic link back to products is handled correctly. `
Example 2:
` Input: startUrl = "http://company.net" htmlParser.getUrls("http://company.net") = ["http://company.net/jobs", "http://news.company.net", "http://partner.com"] htmlParser.getUrls("http://company.net/jobs") = []
Output: ["http://company.net", "http://company.net/jobs"]
Explanation: "http://news.company.net" has a different hostname (subdomain) and "http://partner.com" is an entirely different domain, so both are excluded. `