Given a stream of strings that may contain duplicates, design an algorithm to read strings one by one and remove duplicates, retaining only the first occurrence of each string.
The problem is divided into two parts:
Part 1: Exact deduplication using hash sets
Part 2: Near-duplicate detection using edit distance and normalization
Input: A stream of strings separated by commas
Output: A de-duplicated stream of strings separated by commas
Strings are case-sensitive in Part 1
Must process strings one at a time (streaming)
Part 2 introduces fuzzy matching with configurable thresholds
Remove exact duplicate strings from the stream, keeping only the first occurrence.
Implement a function that:
Processes strings one at a time from an iterator
Maintains a set of seen strings
Yields only strings that have not been seen before
Preserves the order of first occurrences
input_stream = "apple,banana,apple,orange,banana" output = deduplicate_csv_stream(input_stream) print(output) # "apple,banana,orange"
Maintain a hash set of seen strings and filter out duplicates as we process the stream.
Create an empty hash set seen
For each incoming string:
Check if the string exists in seen
If not seen, add to seen and output the string
If already seen, skip it