Tokenizer Implementation

[ OK ] 327 — full content available

[ INFO ] category: Coding difficulty: medium freq: medium first seen: 2026-01-13

[MEDIUM][CODING][MEDIUM]StringGreedyNLP

$ cat problem.md

Implement a tokenizer that converts a string into a list of integer token IDs using a provided vocabulary dictionary. The tokenizer must apply a greedy longest-match strategy: at each position scan ahead to select the longest substring that exists in the vocabulary. If no substring starting at the current position is found, emit a special UNK token (use the value -1) and advance by exactly one character. You will write two functions:

tokenize(text: str, vocab: dict[str, int]) -> list[int] detokenize(tokens: list[int], vocab: dict[str, int]) -> str

The vocabulary dict maps literal string pieces to their integer IDs. detokenize should reverse the process by concatenating the string pieces that correspond to the supplied IDs. You may assume the vocabulary contains every ID that tokenize can return (including -1 mapped to the literal string "UNK" for detokenization). Your implementation should be efficient enough to handle input texts up to 10^5 characters and vocabularies up to 10^5 entries.

user@intervues:~/anthropic$