Given an infinite character stream (or a very large text stream that cannot fit into memory) and a list of stop words (sensitive words), return the substring that appears before the first occurrence of any stop word.
This is a classic streaming processing problem that tests your understanding of memory-efficient algorithms and boundary condition handling.
Memory Efficient: The input is extremely large and cannot be loaded into memory all at once. It must be read in chunks.
Python Generator: Must use the yield keyword to implement streaming processing.
Cross-Chunk Handling: A stop word may be split across two consecutive chunks, and the system must correctly identify it.
Focus on the buffer management strategy to handle cross-chunk stop words
Use generators for memory efficiency - process chunks lazily without loading entire stream
Handle edge cases like empty streams, stop words at boundaries, and overlapping stop words
Write your own test cases and ensure your code compiles and runs correctly
The most critical difficulty in this problem is handling stop words that are split across chunk boundaries.
For example:
` chunks = ["This is a te", "st<st", "op> message"] stop_words = ["<stop>", "<end>"]
`
` def create_stream(): chunks = ["Hello world", " <stop> more", " text"] for chunk in chunks: yield chunk
stop_words = ["<stop>", "<end>"] result = extract_text_before_stopword(create_stream(), stop_words) print(result) # Output: "Hello world " `
Implement a generator function that yields characters until a stop word is found
Maintain a buffer to handle cross-chunk stop word detection
Support multiple stop words and return text before the first occurrence
Return all content if no stop words are found
Should the function be case-sensitive or case-insensitive?
What should happen if multiple stop words overlap in the stream?
How should we handle very small chunks (single characters)?
Can stop words be empty strings or contain special characters?
The key insight is to maintain a buffer that stores unconsumed characters from the previous chunk.
Suppose the longest stop word is "<stop>" (6 characters). To detect it across chunk boundaries:
Keep the last 5 characters from chunk n
Prepend them to chunk n+1
This guarantees any 6-character pattern spanning the boundary will be visible
`
chunks = ["This is a te", "st<st", "op> message"]
`
Calculate max_stop_len from all stop words
Maintain buffer of size max_stop_len - 1
For each chunk, combine buffer + chunk and search for stop words
Yield "safe" characters and update buffer with remaining characters