You are building a web crawler that navigates through web pages starting from a given URL. The crawler should only visit pages that belong to the same hostname as the starting URL.
You are given:
startUrlHtmlParser interface that provides a method getUrls(url) which returns a list of all URLs found on the page at the given URLYour task is to implement a function that crawls all pages starting from startUrl and returns a list of all unique URLs visited. The crawler should only follow links that share the same hostname as the starting URL.
Important: Two URLs have the same hostname if and only if the hostname portion (the part between :// and the first / or end of string) is exactly identical. For example, http://example.com and http://example.com/page have the same hostname, but http://example.com and http://sub.example.com do not.
startUrlhtmlParser.getUrls(url) to retrieve all links on that pagestartUrlhttp://hostname/path or http://hostnamehtmlParser.getUrls() will not return duplicate URLs in a single callExample 1:
` Input: startUrl = "http://news.site.com/articles" htmlParser.getUrls("http://news.site.com/articles") = [ "http://news.site.com/sports", "http://news.site.com/tech", "http://weather.site.com" ] htmlParser.getUrls("http://news.site.com/sports") = ["http://news.site.com/articles"] htmlParser.getUrls("http://news.site.com/tech") = []
Output: ["http://news.site.com/articles", "http://news.site.com/sports", "http://news.site.com/tech"]
Explanation:
Example 2:
` Input: startUrl = "http://store.com" htmlParser.getUrls("http://store.com") = [ "http://store.com/products", "http://blog.store.com" ] htmlParser.getUrls("http://store.com/products") = []
Output: ["http://store.com", "http://store.com/products"]
Explanation: "http://blog.store.com" has a different hostname (blog.store.com vs store.com) so it's not crawled. `
Hint 1: Graph Traversal This problem is essentially a graph traversal where each URL is a node and links are edges. Consider using BFS (Breadth-First Search) or DFS (Depth-First Search) to explore all reachable nodes while keeping track of visited URLs.
Hint 2: Hostname Extraction You need a helper function to extract the hostname from a URL. The hostname is the part between
://and the next/(or end of string if no slash follows). Use string manipulation to extract and compare hostnames.
Hint 3: Avoid Revisiting Use a set data structure to track which URLs have already been visited. Before adding a URL to your queue/stack for processing, check if it's already in the visited set to prevent infinite loops and duplicate work.
Full Solution `` Solution Explanation:
The solution uses a Breadth-First Search (BFS) approach to traverse the web graph:
Hostname Extraction: We create a helper function
get_hostname()that extracts the hostname portion from a URL by finding the text between://and the next/(or end of string).BFS Traversal: We use a queue to process URLs level by level. Starting with the initial URL, we:
- Dequeue the next URL to process
- Fetch all linked URLs using the parser
- Filter URLs to only those with matching hostnames
- Add new, unvisited URLs to both the visited set and the queue
Cycle Prevention: The
visitedset ensures each URL is processed only once, preventing infinite loops from circular references.Hostname Filtering: Before adding a URL to the queue, we verify it shares the same hostname as the starting URL.
Complexity Analysis:
- Time Complexity: O(N) where N is the total number of unique URLs. Each URL is visited exactly once, and for each visit, we process its links.
- Space Complexity: O(N) for storing the visited set and the queue in the worst case where all URLs are on the same domain.
Alternative Approach: You could also use DFS with a stack instead of BFS with a queue. The traversal order would differ, but both would visit the same set of URLs.
from typing import List
from collections import deque
class HtmlParser:
def getUrls(self, url: str) -> List[str]:
pass
def crawl(startUrl: str, htmlParser: HtmlParser) -> List[str]:
"""
Crawl all URLs starting from startUrl that share the same hostname.
Approach: BFS traversal with hostname filtering
Time Complexity: O(N) where N is the number of URLs
Space Complexity: O(N) for the visited set and queue
"""
def get_hostname(url: str) -> str:
"""Extract hostname from URL."""
# Find the start of hostname (after ://)
protocol_end = url.find('://') + 3
# Find the end of hostname (next / or end of string)
path_start = url.find('/', protocol_end)
if path_start == -1:
# No path, hostname goes to end
return url[protocol_end:]
else:
return url[protocol_end:path_start]
# Get the target hostname to match against
target_hostname = get_hostname(startUrl)
# Track visited URLs to avoid duplicates and cycles
visited = set([startUrl])
# BFS queue
queue = deque([startUrl])
while queue:
current_url = queue.popleft()
# Get all URLs linked from the current page
linked_urls = htmlParser.getUrls(current_url)
for url in linked_urls:
# Skip if already visited
if url in visited:
continue
# Only crawl URLs with the same hostname
if get_hostname(url) == target_hostname:
visited.add(url)
queue.append(url)
# Return all visited URLs as a list
return list(visited)