Coding - Stream Deduplication with Near-Duplicate Detection

[ OK ] 673ed7b0-a885-4840-ac5e-1e7adbdca562 — full content available

[ INFO ] category: Coding difficulty: medium freq: Must first seen: 2026-03-13

[MEDIUM][CODING][MUST]

$ cat problem.md

Problem Overview

Given a stream of strings that may contain duplicates, design an algorithm to read strings one by one and remove duplicates, retaining only the first occurrence of each string.

The problem is divided into two parts:

Part 1: Exact deduplication using hash sets
Part 2: Near-duplicate detection using edit distance and normalization

Input: A stream of strings separated by commas

Output: A de-duplicated stream of strings separated by commas

Important Notes

Strings are case-sensitive in Part 1
Must process strings one at a time (streaming)
Part 2 introduces fuzzy matching with configurable thresholds

Part 1: Exact Deduplication

Remove exact duplicate strings from the stream, keeping only the first occurrence.

Requirements

Implement a function that:

Processes strings one at a time from an iterator
Maintains a set of seen strings
Yields only strings that have not been seen before
Preserves the order of first occurrences

Example

input_stream = "apple,banana,apple,orange,banana" output = deduplicate_csv_stream(input_stream) print(output) # "apple,banana,orange"

Maintain a hash set of seen strings and filter out duplicates as we process the stream.

Create an empty hash set seen
For each incoming string:
Check if the string exists in seen
If not seen, add to seen and output the string
If already seen, skip it

user@intervues:~/perplexity$