Practice/Anthropic/Tokenize (Python)
Tokenize (Python)
CodingMust
Part 1: Code Reading and Understanding
You are given two functions: tokenize and detokenize. Your task is to:
- Read and understand what the code is doing
- Identify problems with this tokenization approach
`
def tokenize(text: str, vocab: dict):
tokens = []
key = ""
for i in range(len(text)):
key += text[i]
if key in vocab:
tokens.append(vocab[key])
key = ""
return tokens
def detokenize(tokens, vocab: dict):
text = ""
reversed_vocab = {value: key for key, value in vocab.items()}
for token in tokens:
text += reversed_vocab[token]
return text
`
Example usage:
`
vocab = {
"a": 1,
"b": 2,
"cd": 3
}
tokens = tokenize("acdebe", vocab)
result = detokenize(tokens, vocab)
`
Your Task
- Explain what this code does and how it works
- Identify what problems exist with this implementation
- Discuss when and why this approach would fail
Key Points to Consider
- How does the algorithm decide what to tokenize?
- What happens if a character in the text is not covered by any key in vocab?
- What happens to the
key variable if no match is ever found?
- Is this a greedy or optimal approach?
Part 2: Code Review
You are now shown a modified version where another engineer attempted to fix the problem:
`
def tokenize(text: str, vocab: dict):
tokens = []
key = ""
for i in range(len(text)):
key += text[i]
if key in vocab:
tokens.append(vocab[key])
key = ""
# NEW: Handle remaining unmatched characters
if key:
tokens.append(vocab.get("UNK", -1))
return tokens
vocab = {
"a": 1,
"b": 2,
"cd": 3,
"UNK": -1 # NEW: Added UNK token
}
`
The changes made:
- Added logic after the for loop to handle the case when
key is not empty
- Added
"UNK": -1 to the vocabulary to represent unknown tokens
Your Task
Perform a code review on this change. Provide comments, questions, and concerns about this implementation.
Points to Consider
- What is the purpose of UNK?
- What is the expected behavior of this change?
- Does this fix actually solve the problem? Why or why not?
- What edge cases might still fail?
- Are there any conceptual or implementation issues?
Example Edge Cases to Think About
- What if the text is
"xabc" where "x" is not in vocab?
- What if the text literally contains the string
"UNK"? How would you distinguish between an unknown character being replaced by UNK vs the literal string "UNK" in the original text?
- Does the matching strategy always produce optimal results?
Part 3: Implementation
Now implement your own version of the tokenize function that properly handles characters that cannot be covered by the vocabulary.
Requirements
Your implementation should:
- Handle unknown characters — gracefully deal with text that contains characters not in vocab
- Maintain correctness — ensure both
tokenize and detokenize still work properly
- Use UNK token — represent unknown/unmatched characters with a special token (e.g.,
UNK: -1)
- Note on round-trip limitations — be aware that if the original text contains the literal string
"UNK", it cannot be distinguished from unknown characters after tokenization. This means perfect round-trip preservation is not always possible with this approach.
Key Challenges
- When should you decide that a character sequence cannot be matched?
- How do you detect when you're building a key that will never match anything in vocab?
- Should you check if any vocab key starts with the current key being built?
- How do you balance between trying to match longer sequences vs giving up early?
Note for Candidate
Background context: In natural language processing, tokenization converts text into discrete tokens (words, subwords, or characters) that can be mapped to numerical IDs. A vocabulary (vocab) defines the valid tokens.
- UNK token: Stands for "UNKNOWN" - a special token used to represent text that doesn't appear in the vocabulary
- Greedy matching: The original algorithm uses a greedy first-match strategy - it takes the first valid match it encounters while building up characters one at a time
- You may need to check if the current key is a prefix of any vocab key to determine if you should continue building or give up