Level: Unknown Level
Round: Full Journey · Type: Coding · Difficulty: 7/10 · Duration: 240 min · Interviewer: Neutral
Topics: Algorithms, Data Structures, String Manipulation, Streaming Algorithms, Optimization, Multithreading
Location: San Francisco Bay Area
Interview date: 2026-01-03
Question: Implement a function that acts like a simple grep tool to search for lines containing a specific target string, printing the line and a specified number of lines before and after it. Duplicates should be avoided, and the original order of lines should be maintained.
Question: Adjust the solution for a streaming input where lines come in one at a time, without the ability to see the entire list at the start. Use O(k) memory.
Question: Optimize the solution, especially when the number of lines around a match (k) is very large, to improve performance.
Question: Design a solution for handling massive files using multiple CPU threads in a map-reduce style.
The problem required me to implement a grep tool with context lines. The interview progressed through several stages:
Part 1: Static Input
I was given a list of strings and needed to find lines containing a specific target string, printing those lines along with a specified number of lines before and after each match. I needed to avoid duplicates and preserve the original order of lines.
My approach was to use a boolean list to mark lines to keep. I looped through each line, and if it contained the target, I marked that line and its neighbors as True. Finally, I returned the marked lines. The code is:
` def grep_with_context( lines: list[str], search_target: str, lines_around: int, ) -> list[str]: if lines_around < 0: raise ValueError("lines_around must be >= 0")
n = len(lines)
marked = [False] * n
k = lines_around
# Scan the lines
for i, line in enumerate(lines):
if search_target in line:
# Determine the start and end of the window
left = max(0, i - k)
right = min(n - 1, i + k)
# Mark all lines in this window
for j in range(left, right + 1):
marked[j] = True
# Collect marked lines
return [line for i, line in enumerate(lines) if marked[i]]
`
The time complexity is O(n * k) in the worst case, and the space complexity is O(n).
Part 2: Streaming Input
Next, the requirement changed to processing lines one at a time. I couldn't see the whole list at the start.
I used a deque to store the last k lines to handle the "before" context and a variable, emit_until, to track how far into the future to print lines. I also used a flag on each buffered line to avoid duplicates. Here's the code:
` from collections import deque
class StreamingGrep: def init(self, search_target: str, lines_around: int): if lines_around < 0: raise ValueError("lines_around must be >= 0") self.search_target = search_target self.k = lines_around self.idx = -1 self.emit_until = -1 # Each entry stores: [index, line, is_printed] self.buffer = deque()
def process_line(self, line: str) -> list[str]:
self.idx += 1
out: list[str] = []
# Add new line to buffer
self.buffer.append([self.idx, line, False])
# Remove lines that are too old to be "before context"
min_keep_idx = self.idx - self.k
while self.buffer and self.buffer[0][0] < min_keep_idx:
self.buffer.popleft()
is_match = self.search_target in line
if is_match:
# Update how far into the future we need to print
self.emit_until = max(self.emit_until, self.idx + self.k)
# Print everything currently in the buffer
for entry in self.buffer:
if not entry[2]: # If not printed yet
out.append(entry[1])
entry[2] = True
# Print the current line (or buffered lines) if they are within the "after context" range
for entry in self.buffer:
if entry[0] <= self.emit_until and not entry[2]:
out.append(entry[1])
entry[2] = True
return out
`
Part 3: Optimization
I needed to optimize the solution for large values of k. Instead of marking each line individually, I used intervals. When a match was found, I calculated the range [start, end]. If this range overlapped with the previous range, I merged them. Otherwise, I saved the previous range and started a new one. The code is:
` def grep_with_context_optimized( lines: list[str], search_target: str, lines_around: int, ) -> list[str]: if lines_around < 0: raise ValueError("lines_around must be >= 0")
n = len(lines)
k = lines_around
intervals: list[list[int]] = []
for i, line in enumerate(lines):
if search_target in line:
left = max(0, i - k)
right = min(n - 1, i + k)
# If list is empty or new range does not overlap, add it
if not intervals or left > intervals[-1][1] + 1:
intervals.append([left, right])
else:
# Merge with the previous interval
intervals[-1][1] = max(intervals[-1][1], right)
result: list[str] = []
# Collect lines based on merged intervals
for left, right in intervals:
for i in range(left, right + 1):
result.append(lines[i])
return result
`
Part 4: Multithreading
Finally, I discussed how to solve the problem for a massive file using multiple CPU threads. I suggested a map-reduce approach where workers scan chunks of lines and return intervals, and a coordinator merges these intervals and prints the lines.