Practice/Anthropic/Tokenize (Python)

Tokenize (Python)

CodingMust

Part 1: Code Reading and Understanding

You are given two functions: tokenize and detokenize. Your task is to:

Read and understand what the code is doing
Identify problems with this tokenization approach

` def tokenize(text: str, vocab: dict): tokens = [] key = "" for i in range(len(text)): key += text[i] if key in vocab: tokens.append(vocab[key]) key = "" return tokens

def detokenize(tokens, vocab: dict): text = "" reversed_vocab = {value: key for key, value in vocab.items()} for token in tokens: text += reversed_vocab[token] return text `

Example usage:

` vocab = { "a": 1, "b": 2, "cd": 3 }

tokens = tokenize("acdebe", vocab) result = detokenize(tokens, vocab) `

Your Task

Explain what this code does and how it works
Identify what problems exist with this implementation
Discuss when and why this approach would fail

Key Points to Consider

How does the algorithm decide what to tokenize?
What happens if a character in the text is not covered by any key in vocab?
What happens to the key variable if no match is ever found?
Is this a greedy or optimal approach?

Part 2: Code Review

You are now shown a modified version where another engineer attempted to fix the problem:

` def tokenize(text: str, vocab: dict): tokens = [] key = "" for i in range(len(text)): key += text[i] if key in vocab: tokens.append(vocab[key]) key = ""

# NEW: Handle remaining unmatched characters
if key:
    tokens.append(vocab.get("UNK", -1))

return tokens

vocab = { "a": 1, "b": 2, "cd": 3, "UNK": -1 # NEW: Added UNK token } `

The changes made:

Added logic after the for loop to handle the case when key is not empty
Added "UNK": -1 to the vocabulary to represent unknown tokens

Your Task

Perform a code review on this change. Provide comments, questions, and concerns about this implementation.

Points to Consider

What is the purpose of UNK?
What is the expected behavior of this change?
Does this fix actually solve the problem? Why or why not?
What edge cases might still fail?
Are there any conceptual or implementation issues?

Example Edge Cases to Think About

What if the text is "xabc" where "x" is not in vocab?
What if the text literally contains the string "UNK"? How would you distinguish between an unknown character being replaced by UNK vs the literal string "UNK" in the original text?
Does the matching strategy always produce optimal results?

Part 3: Implementation

Now implement your own version of the tokenize function that properly handles characters that cannot be covered by the vocabulary.

Requirements

Your implementation should:

Handle unknown characters — gracefully deal with text that contains characters not in vocab
Maintain correctness — ensure both tokenize and detokenize still work properly
Use UNK token — represent unknown/unmatched characters with a special token (e.g., UNK: -1)
Note on round-trip limitations — be aware that if the original text contains the literal string "UNK", it cannot be distinguished from unknown characters after tokenization. This means perfect round-trip preservation is not always possible with this approach.

Key Challenges

When should you decide that a character sequence cannot be matched?
How do you detect when you're building a key that will never match anything in vocab?
Should you check if any vocab key starts with the current key being built?
How do you balance between trying to match longer sequences vs giving up early?

Note for Candidate

Background context: In natural language processing, tokenization converts text into discrete tokens (words, subwords, or characters) that can be mapped to numerical IDs. A vocabulary (vocab) defines the valid tokens.

UNK token: Stands for "UNKNOWN" - a special token used to represent text that doesn't appear in the vocabulary
Greedy matching: The original algorithm uses a greedy first-match strategy - it takes the first valid match it encounters while building up characters one at a time
You may need to check if the current key is a prefix of any vocab key to determine if you should continue building or give up

Practice/Anthropic/Tokenize (Python)

Tokenize (Python)

CodingMust

Part 1: Code Reading and Understanding

You are given two functions: tokenize and detokenize. Your task is to:

Read and understand what the code is doing
Identify problems with this tokenization approach

` def tokenize(text: str, vocab: dict): tokens = [] key = "" for i in range(len(text)): key += text[i] if key in vocab: tokens.append(vocab[key]) key = "" return tokens

def detokenize(tokens, vocab: dict): text = "" reversed_vocab = {value: key for key, value in vocab.items()} for token in tokens: text += reversed_vocab[token] return text `

Example usage:

` vocab = { "a": 1, "b": 2, "cd": 3 }

tokens = tokenize("acdebe", vocab) result = detokenize(tokens, vocab) `

Your Task

Explain what this code does and how it works
Identify what problems exist with this implementation
Discuss when and why this approach would fail

Key Points to Consider

How does the algorithm decide what to tokenize?
What happens if a character in the text is not covered by any key in vocab?
What happens to the key variable if no match is ever found?
Is this a greedy or optimal approach?

Part 2: Code Review

You are now shown a modified version where another engineer attempted to fix the problem:

` def tokenize(text: str, vocab: dict): tokens = [] key = "" for i in range(len(text)): key += text[i] if key in vocab: tokens.append(vocab[key]) key = ""

# NEW: Handle remaining unmatched characters
if key:
    tokens.append(vocab.get("UNK", -1))

return tokens

vocab = { "a": 1, "b": 2, "cd": 3, "UNK": -1 # NEW: Added UNK token } `

The changes made:

Added logic after the for loop to handle the case when key is not empty
Added "UNK": -1 to the vocabulary to represent unknown tokens

Your Task

Perform a code review on this change. Provide comments, questions, and concerns about this implementation.

Points to Consider

What is the purpose of UNK?
What is the expected behavior of this change?
Does this fix actually solve the problem? Why or why not?
What edge cases might still fail?
Are there any conceptual or implementation issues?

Example Edge Cases to Think About

What if the text is "xabc" where "x" is not in vocab?
What if the text literally contains the string "UNK"? How would you distinguish between an unknown character being replaced by UNK vs the literal string "UNK" in the original text?
Does the matching strategy always produce optimal results?

Part 3: Implementation

Now implement your own version of the tokenize function that properly handles characters that cannot be covered by the vocabulary.

Requirements

Your implementation should:

Handle unknown characters — gracefully deal with text that contains characters not in vocab
Maintain correctness — ensure both tokenize and detokenize still work properly
Use UNK token — represent unknown/unmatched characters with a special token (e.g., UNK: -1)
Note on round-trip limitations — be aware that if the original text contains the literal string "UNK", it cannot be distinguished from unknown characters after tokenization. This means perfect round-trip preservation is not always possible with this approach.

Key Challenges

When should you decide that a character sequence cannot be matched?
How do you detect when you're building a key that will never match anything in vocab?
Should you check if any vocab key starts with the current key being built?
How do you balance between trying to match longer sequences vs giving up early?

Note for Candidate

UNK token: Stands for "UNKNOWN" - a special token used to represent text that doesn't appear in the vocabulary
Greedy matching: The original algorithm uses a greedy first-match strategy - it takes the first valid match it encounters while building up characters one at a time
You may need to check if the current key is a prefix of any vocab key to determine if you should continue building or give up