Web Crawler

Basic Problem

You are given:

A starting URL startUrl
An interface HtmlParser that can fetch all URLs from a given web page Implement a web crawler that returns all URLs reachable from startUrl that share the same hostname as startUrl. The order of URLs in the result does not matter. Below is the interface for HtmlParser: [Source: darkinterview.com] interface HtmlParser { // Returns all URLs from a given page URL. public List<String> getUrls(String url); Your function will be called like List<String> crawl(String startUrl, HtmlParser htmlParser)`.

Requirements

Your crawler should:

Start from startUrl.
Use HtmlParser.getUrls(url) to obtain all links from a page.
Avoid duplicates — never crawl the same URL twice.
Restrict scope — only follow URLs whose hostname matches that of startUrl.
Assume all URLs use the http protocol and do not include a port.

Note for Candidate

URL Fragments: Consider how to handle URL fragments (e.g., http://example.com/page#section1). Should these be treated as the same URL or different? Clarify with the interviewer if needed.
URL Normalization: Assume URL normalization is NOT required for the basic problem.

Follow-up: Multithreaded/Concurrent Implementation (Important!!)

After implementing the basic single-threaded version, implement a multithreaded or concurrent version of the web crawler to improve performance. [Source: darkinterview.com]

Requirements

Parallelize the crawling — multiple URLs should be fetched concurrently

user@intervues:~/anthropic$