Software Engineer · Full Journey · Coding — Snowflake

Snowflake — Software Engineer ✅ Passed

Level: Unknown Level

Round: Full Journey · Type: Coding · Difficulty: 7/10 · Duration: 240 min · Interviewer: Neutral

Topics: Algorithms, Data Structures, String Manipulation, Streaming Algorithms, Optimization, Multithreading

Location: San Francisco Bay Area

Interview date: 2026-01-03

Summary

Round 1: Static Input

Question: Implement a function that acts like a simple grep tool to search for lines containing a specific target string, printing the line and a specified number of lines before and after it. Duplicates should be avoided, and the original order of lines should be maintained.

Round 2: Streaming Input

Question: Adjust the solution for a streaming input where lines come in one at a time, without the ability to see the entire list at the start. Use O(k) memory.

Round 3: Optimization

Question: Optimize the solution, especially when the number of lines around a match (k) is very large, to improve performance.

Round 4: Multithreading

Question: Design a solution for handling massive files using multiple CPU threads in a map-reduce style.

Details

The problem required me to implement a grep tool with context lines. The interview progressed through several stages:

Part 1: Static Input

I was given a list of strings and needed to find lines containing a specific target string, printing those lines along with a specified number of lines before and after each match. I needed to avoid duplicates and preserve the original order of lines.

My approach was to use a boolean list to mark lines to keep. I looped through each line, and if it contained the target, I marked that line and its neighbors as True. Finally, I returned the marked lines. The code is:

` def grep_with_context( lines: list[str], search_target: str, lines_around: int, ) -> list[str]: if lines_around < 0: raise ValueError("lines_around must be >= 0")

n = len(lines)
marked = [False] * n
k = lines_around

# Scan the lines
for i, line in enumerate(lines):
    if search_target in line:
        # Determine the start and end of the window
        left = max(0, i - k)
        right = min(n - 1, i + k)
        # Mark all lines in this window
        for j in range(left, right + 1):
            marked[j] = True

# Collect marked lines
return [line for i, line in enumerate(lines) if marked[i]]

The time complexity is O(n * k) in the worst case, and the space complexity is O(n).

Part 2: Streaming Input

Next, the requirement changed to processing lines one at a time. I couldn't see the whole list at the start.

I used a deque to store the last k lines to handle the "before" context and a variable, emit_until, to track how far into the future to print lines. I also used a flag on each buffered line to avoid duplicates. Here's the code:

` from collections import deque

class StreamingGrep: def init(self, search_target: str, lines_around: int): if lines_around < 0: raise ValueError("lines_around must be >= 0") self.search_target = search_target self.k = lines_around self.idx = -1 self.emit_until = -1 # Each entry stores: [index, line, is_printed] self.buffer = deque()

def process_line(self, line: str) -> list[str]:
    self.idx += 1
    out: list[str] = []
    # Add new line to buffer
    self.buffer.append([self.idx, line, False])

    # Remove lines that are too old to be "before context"
    min_keep_idx = self.idx - self.k
    while self.buffer and self.buffer[0][0] < min_keep_idx:
        self.buffer.popleft()

    is_match = self.search_target in line
    if is_match:
        # Update how far into the future we need to print
        self.emit_until = max(self.emit_until, self.idx + self.k)
        
        # Print everything currently in the buffer
        for entry in self.buffer:
            if not entry[2]: # If not printed yet
                out.append(entry[1])
                entry[2] = True

    # Print the current line (or buffered lines) if they are within the "after context" range
    for entry in self.buffer:
        if entry[0] <= self.emit_until and not entry[2]:
            out.append(entry[1])
            entry[2] = True

    return out

Part 3: Optimization

I needed to optimize the solution for large values of k. Instead of marking each line individually, I used intervals. When a match was found, I calculated the range [start, end]. If this range overlapped with the previous range, I merged them. Otherwise, I saved the previous range and started a new one. The code is:

` def grep_with_context_optimized( lines: list[str], search_target: str, lines_around: int, ) -> list[str]: if lines_around < 0: raise ValueError("lines_around must be >= 0")

n = len(lines)
k = lines_around
intervals: list[list[int]] = []

for i, line in enumerate(lines):
    if search_target in line:
        left = max(0, i - k)
        right = min(n - 1, i + k)
        
        # If list is empty or new range does not overlap, add it
        if not intervals or left > intervals[-1][1] + 1:
            intervals.append([left, right])
        else:
            # Merge with the previous interval
            intervals[-1][1] = max(intervals[-1][1], right)

result: list[str] = []
# Collect lines based on merged intervals
for left, right in intervals:
    for i in range(left, right + 1):
        result.append(lines[i])
return result

Part 4: Multithreading

Finally, I discussed how to solve the problem for a massive file using multiple CPU threads. I suggested a map-reduce approach where workers scan chunks of lines and return intervals, and a coordinator merges these intervals and prints the lines.