Leetcode 1236. Web Crawler — Ramp

Problem

You are building a web crawler that navigates through web pages starting from a given URL. The crawler should only visit pages that belong to the same hostname as the starting URL.

You are given:

A starting URL startUrl
An HtmlParser interface that provides a method getUrls(url) which returns a list of all URLs found on the page at the given URL

Your task is to implement a function that crawls all pages starting from startUrl and returns a list of all unique URLs visited. The crawler should only follow links that share the same hostname as the starting URL.

Important: Two URLs have the same hostname if and only if the hostname portion (the part between :// and the first / or end of string) is exactly identical. For example, http://example.com and http://example.com/page have the same hostname, but http://example.com and http://sub.example.com do not.

Requirements

Start crawling from the given startUrl
For each URL visited, use htmlParser.getUrls(url) to retrieve all links on that page
Only visit URLs that have the same hostname as startUrl
Each URL should be visited at most once (avoid duplicates and infinite loops)
Return all visited URLs in any order

Constraints

1 ≤ length of URLs ≤ 1000
All URLs will be valid and follow the format: http://hostname/path or http://hostname
The hostname will contain only lowercase English letters, digits, and dots
htmlParser.getUrls() will not return duplicate URLs in a single call
The total number of unique URLs will not exceed 1000

Examples

Example 1:

` Input: startUrl = "http://news.site.com/articles" htmlParser.getUrls("http://news.site.com/articles") = [ "http://news.site.com/sports", "http://news.site.com/tech", "http://weather.site.com" ] htmlParser.getUrls("http://news.site.com/sports") = ["http://news.site.com/articles"] htmlParser.getUrls("http://news.site.com/tech") = []

Output: ["http://news.site.com/articles", "http://news.site.com/sports", "http://news.site.com/tech"]

Explanation:

Start at "http://news.site.com/articles"
Find links to sports and tech pages (same hostname) and weather page (different hostname)
Visit sports and tech pages, ignore weather page
Sports page links back to articles (already visited, skip) `

Example 2:

` Input: startUrl = "http://store.com" htmlParser.getUrls("http://store.com") = [ "http://store.com/products", "http://blog.store.com" ] htmlParser.getUrls("http://store.com/products") = []

Output: ["http://store.com", "http://store.com/products"]

Explanation: "http://blog.store.com" has a different hostname (blog.store.com vs store.com) so it's not crawled. `

Hint 1: Graph Traversal This problem is essentially a graph traversal where each URL is a node and links are edges. Consider using BFS (Breadth-First Search) or DFS (Depth-First Search) to explore all reachable nodes while keeping track of visited URLs.

Hint 2: Hostname Extraction You need a helper function to extract the hostname from a URL. The hostname is the part between :// and the next / (or end of string if no slash follows). Use string manipulation to extract and compare hostnames.

Hint 3: Avoid Revisiting Use a set data structure to track which URLs have already been visited. Before adding a URL to your queue/stack for processing, check if it's already in the visited set to prevent infinite loops and duplicate work.

Full Solution `` Solution Explanation:

The solution uses a Breadth-First Search (BFS) approach to traverse the web graph:

Hostname Extraction: We create a helper function get_hostname() that extracts the hostname portion from a URL by finding the text between :// and the next / (or end of string).

BFS Traversal: We use a queue to process URLs level by level. Starting with the initial URL, we:

Dequeue the next URL to process

Fetch all linked URLs using the parser

Filter URLs to only those with matching hostnames

Add new, unvisited URLs to both the visited set and the queue

Cycle Prevention: The visited set ensures each URL is processed only once, preventing infinite loops from circular references.

Hostname Filtering: Before adding a URL to the queue, we verify it shares the same hostname as the starting URL.

Complexity Analysis:

Time Complexity: O(N) where N is the total number of unique URLs. Each URL is visited exactly once, and for each visit, we process its links.

Space Complexity: O(N) for storing the visited set and the queue in the worst case where all URLs are on the same domain.

Alternative Approach: You could also use DFS with a stack instead of BFS with a queue. The traversal order would differ, but both would visit the same set of URLs.