Build a multithreaded web crawler that explores pages within a single hostname. Given a starting URL and an HTML parser interface, your crawler should discover and return all unique URLs that belong to the same hostname as the starting URL.
The HtmlParser interface provides a method getUrls(url) that returns a list of all URLs found on the page at the given URL. Your implementation should:
Hostname matching rules:
getUrls method is thread-safe and can be called concurrentlyExample 1:
` Input: startUrl = "http://shop.example.com/products" htmlParser.getUrls("http://shop.example.com/products") = ["http://shop.example.com/cart", "http://shop.example.com/about"] htmlParser.getUrls("http://shop.example.com/cart") = [] htmlParser.getUrls("http://shop.example.com/about") = ["http://shop.example.com/products"]
Output: ["http://shop.example.com/products", "http://shop.example.com/cart", "http://shop.example.com/about"]
Explanation: All three URLs share the hostname "shop.example.com" and are reachable from the start URL. The cyclic link back to products is handled correctly. `
Example 2:
` Input: startUrl = "http://company.net" htmlParser.getUrls("http://company.net") = ["http://company.net/jobs", "http://news.company.net", "http://partner.com"] htmlParser.getUrls("http://company.net/jobs") = []
Output: ["http://company.net", "http://company.net/jobs"]
Explanation: "http://news.company.net" has a different hostname (subdomain) and "http://partner.com" is an entirely different domain, so both are excluded. `
Hint 1: Thread Safety Use a thread-safe set to track visited URLs and a lock to synchronize access when checking and adding new URLs. This prevents race conditions where multiple threads might try to visit the same URL simultaneously.
Hint 2: Hostname Extraction Parse URLs to extract the hostname component. Python's
urllib.parsemodule provides utilities for this. Remember that hostnames should be compared case-insensitively and don't include the protocol or port.
Hint 3: Thread Coordination Consider using a queue or a set of threads that need to be joined. Each thread should process one URL, extract its links, spawn new threads for unvisited URLs with matching hostnames, and wait for child threads to complete.
Full Solution `` Approach:
Hostname Extraction: Parse the starting URL to extract its hostname for filtering. Use
urlparseto reliably extract the hostname component and normalize it to lowercase for case-insensitive comparison.Thread-Safe Visited Set: Maintain a set of visited URLs protected by a lock. Before crawling any URL, check if it's already visited using thread-safe operations (lock acquisition, check, add, lock release).
Recursive Parallel Crawling: For each URL, create a thread that:
- Fetches all links from the current page
- Filters links to keep only those with matching hostnames
- Spawns new threads for unvisited URLs
- Waits for all child threads to complete (join)
Thread Coordination: Each thread joins all its child threads before completing. This ensures the main thread waits until the entire crawl is complete before returning results.
Time Complexity: O(N) where N is the number of unique URLs, as each URL is visited exactly once.
Space Complexity: O(N) for the visited set plus O(N) for thread stack space in the worst case of a deep crawl tree.
Thread Safety Considerations: The critical section is when checking and adding URLs to the visited set. The lock ensures that even if multiple threads discover the same URL simultaneously, only one will proceed to crawl it.
from typing import List
from threading import Thread, Lock
from urllib.parse import urlparse
class HtmlParser:
def getUrls(self, url: str) -> List[str]:
"""Returns all URLs found on the given page"""
pass
class Solution:
def crawl(self, startUrl: str, htmlParser: HtmlParser) -> List[str]:
# Extract hostname from the starting URL for comparison
def get_hostname(url: str) -> str:
return urlparse(url).hostname.lower() if urlparse(url).hostname else ""
target_hostname = get_hostname(startUrl)
# Thread-safe set to track visited URLs
visited = set()
visited_lock = Lock()
def crawl_url(url: str):
"""Crawls a single URL and spawns threads for unvisited links"""
# Get all links from the current page
urls = htmlParser.getUrls(url)
# List to hold threads for child URLs
threads = []
for next_url in urls:
# Check if the URL has the same hostname
if get_hostname(next_url) != target_hostname:
continue
# Thread-safe check and add to visited set
with visited_lock:
if next_url in visited:
continue
visited.add(next_url)
# Spawn a new thread to crawl this URL
thread = Thread(target=crawl_url, args=(next_url,))
thread.start()
threads.append(thread)
# Wait for all child threads to complete
for thread in threads:
thread.join()
# Add starting URL to visited set
visited.add(startUrl)
# Start crawling from the initial URL
crawl_url(startUrl)
# Return all discovered URLs
return list(visited)