This is a multi-part online assessment focused on implementing and optimizing a byte-level tokenizer similar to Byte Pair Encoding (BPE), widely used in LLM tokenization. The assessment tests algorithm comprehension, performance optimization, and statistical estimation.
The assessment consists of three parts:
Part 0: Study a reference implementation and describe the algorithm
Part 1: Implement a faster version that produces identical output
Part 2: Estimate token counts when full tokenization is too expensive
Read the slow_tokenize implementation carefully to understand the algorithm
Your tokenize method must return identical results to slow_tokenize
Target performance: ~500KB of text with 10,000+ token vocabulary in under 10 seconds
Study the slow_tokenize method and write a precise natural language description of the tokenization algorithm.
The tokenization algorithm converts a byte sequence into a list of token IDs using a two-step process:
Step 1 — Initialization: Convert the input byte sequence into a list of token IDs by treating each byte as its own single-byte token. Since the first 256 entries of token_alphabet are guaranteed to be bytes([i]) for i in 0..255, the byte value itself is the token ID.
Step 2 — Iterative Merging: Process each multi-byte token in the alphabet in order of ascending token ID (from 256 to the end). For each token:
Determine mergeable pairs: For the current token's byte sequence, try every possible split point. Each split produces left and right byte subsequences. If both exist as tokens, then their token ID pair is a valid "merge pair".
Apply merges left-to-right: Scan through the current token ID list. At each position i, check whether (token_ids[i], token_ids[i+1]) is in the merge pair set. If yes, append the current token's ID and skip both positions. Otherwise, copy token_ids[i] unchanged.
A single multi-byte token can have multiple valid merge pairs (different split points)
Merges are greedy left-to-right: once a pair is merged, the right element cannot participate in another merge
Tokens are processed in strict ascending ID order; earlier merges change the list before later tokens are considered
Implement the tokenize method that returns identical results to slow_tokenize but runs efficiently.
The reference implementation has time complexity O(V × N) where V is vocabulary size and N is text length. For each of ~10,000 multi-byte tokens, it scans the entire token list — wasteful when most tokens have zero matching pairs.
Instead of iterating over every token in the vocabulary:
Precompute merge rules: For each pair of adjacent token IDs, determine what token they merge into and at what priority (the target token's ID)
Use a min-heap: Track all mergeable adjacent pairs with their target token ID as priority. Process the lowest-priority (earliest) merge first
Use a doubly-linked list: Enable O(1) merge operations instead of rebuilding the entire list
Preprocess the token_alphabet in init to build merge_rules dictionary
Implement tokenize using priority queue and linked list approach
Return identical results to slow_tokenize for all inputs
Hint: Precomputing Merge RulesBuild a dictionary mapping (left_id, right_id) pairs to the target token they merge into:
self.merge_rules: dict[tuple[int, int], int] = {} for token_id in range(256, len(self.token_alphabet)): token_bytes = self.token_alphabet[token_id] for split_point in range(1, len(token_bytes)): left_bytes = token_bytes[:split_point] right_bytes = token_bytes[split_point:] left_id = self.token_to_id.get(left_bytes) right_id = self.token_to_id.get(right_bytes) if left_id is not None and right_id is not None: self.merge_rules[(left_id, right_id)] = token_id
Hint: Priority Queue AlgorithmUse a heap to process merges in ascending token ID order:
`
heap = [] for i in range(len(token_ids) - 1): pair = (token_ids[i], token_ids[i + 1]) if pair in self.merge_rules: heapq.heappush(heap, (self.merge_rules[pair], i))
while heap: target_id, pos = heapq.heappop(heap) # Validate merge is still possible # Perform merge: update token_ids[pos], mark next position as deleted # Update linked list and push new adjacent pairs to heap `
Implement estimate_token_count that estimates len(self.tokenize(text)) without tokenizing the entire text.
Total bytes passed to self.tokenize across all calls must not exceed sample_size
If sample_size >= len(text), return the exact count
Accuracy target: within 20% for any sample size, within 5% for sample sizes greater than 10,000
The key observation is that the tokens-per-byte ratio tends to be consistent across a document. We can:
Divide the text into segments
Sample a chunk from each segment (stratified sampling)
Tokenize each chunk and compute the local tokens-per-byte ratio
Extrapolate to the full text
Why stratified over pure random? Stratified sampling ensures coverage across the entire document, reducing variance — especially important for heterogeneous text.
Use at least 4 samples for coverage
Use chunks of at least 250 bytes
Divide the budget across segments evenly
Use the provided rng for any randomization
Hint: Stratified Sampling Strategy`
chunk_size = max(250, min(sample_size // 4, len(text) // 10)) num_samples = sample_size // chunk_size if num_samples < 1: num_samples = 1 chunk_size = sample_size
segment_size = len(text) // num_samples
for i in range(num_samples): seg_start = i * segment_size seg_end = min(seg_start + segment_size, len(text)) # Pick random chunk within segment # Tokenize and accumulate ratio `
Full Solution (Part 2)$46Why this works: The tokens-per-byte ratio is relatively stable within a document. Boundary effects (tokens that would span chunk boundaries) contribute less than 1% error with chunks of 250+ bytes.
Accuracy by sample size:
sample_size = 1000: 4 chunks of 250 bytes, ~20% accuracy
sample_size = 10000: 4 chunks of 2500 bytes, less than 5% error
sample_size = 50000: 4 chunks of 12500 bytes, less than 1% error
The ascending token ID order corresponds to the order in which tokens were learned during BPE training. Token 256 was the first merge learned (most frequent byte pair), token 257 was the second, etc. Applying merges in this order ensures deterministic tokenization matching training.
Production tokenizers (tiktoken, SentencePiece, HuggingFace tokenizers) use optimized versions:
Written in Rust/C++ for performance
Hash maps for O(1) pair lookups
Regex-based pre-tokenization (splitting on whitespace, punctuation)
Parallel processing of independent text chunks
The priority queue approach in Part 1 matches the algorithmic complexity of these implementations.
Overlap correction: Tokenize overlapping chunks and compare overlap regions
Weighted sampling: Sample more from regions with higher variance
Byte-frequency model: Build a model mapping byte n-gram frequencies to token counts
Adaptive chunk sizing: Start with large chunks, then sample more from differing regions
Time complexity: O(N log N) where N is the text length
Space complexity: O(N) for linked list and heap