Practice/Apple/Web Content Parsing and Traversal Algorithms

Web Content Parsing and Traversal Algorithms

CodingMust

Problem

You are building a web scraping tool that needs to extract specific HTML elements from raw HTML strings. Your task is to implement a function that finds all occurrences of a particular HTML tag within an HTML document and returns them as a list of strings.

Given an HTML string and a tag name, you need to parse the HTML structure and locate all instances of that tag, including their complete content (opening tag, content, and closing tag). The tags should be returned in the order they appear in the document using a depth-first traversal approach.

For example, given the HTML <div><p>Hello</p><span>World</span></div> and the tag name "p", your function should return ['<p>Hello</p>'].

Requirements

Parse the HTML string and build an internal representation (tree structure)
Search for all occurrences of the specified tag name
Return complete tag strings including opening tag, content, and closing tag
Return results in document order (depth-first traversal)
Handle nested structures correctly
Return an empty list if no matching tags are found

Constraints

The HTML string length will be between 1 and 10,000 characters
Tag names will be lowercase alphabetic characters only
HTML is well-formed (properly nested and closed tags)
Tags do not have attributes for this problem
Time complexity should be O(n) where n is the length of the HTML string
Space complexity should be O(n) for the tree structure

Examples

Example 1:

Input: html = "<div><p>Hello</p><span>World</span></div>", tag_name = "p" Output: ['<p>Hello</p>'] Explanation: There is one paragraph tag containing "Hello"

Example 2:

Input: html = "<div><p>First</p><p>Second</p></div>", tag_name = "p" Output: ['<p>First</p>', '<p>Second</p>'] Explanation: Both paragraph tags are found in the order they appear

Example 3:

Input: html = "<div><div><span>Nested</span></div></div>", tag_name = "div" Output: ['<div><div><span>Nested</span></div></div>', '<div><span>Nested</span></div>'] Explanation: The outer div contains the inner div, both are returned with their complete content