This is a multi-part online assessment focused on implementing and optimizing a byte-level tokenizer similar to Byte Pair Encoding (BPE), widely used in LLM tokenization. The assessment tests algorithm comprehension, performance optimization, and statistical estimation.
The assessment consists of three parts:
Study the slow_tokenize method and write a precise natural language description of the tokenization algorithm.
The tokenization algorithm converts a byte sequence into a list of token IDs using a two-step process:
Step 1 — Initialization: Convert the input byte sequence into a list of token IDs by treating each byte as its own single-byte token. Since the first 256 entries of token_alphabet are guaranteed to be bytes([i]) for i in 0..255, the byte value itself is the token ID.
Step 2 — Iterative Merging: Process each multi-byte token in the alphabet in order of ascending token ID (from 256 to the end). For each token:
Implement the tokenize method that returns identical results to slow_tokenize but runs efficiently.
The reference implementation has time complexity O(V × N) where V is vocabulary size and N is text length. For each of ~10,000 multi-byte tokens, it scans the entire token list — wasteful when most tokens have zero matching pairs.
Instead of iterating over every token in the vocabulary:
This reduces complexity to O(N log N).
`
`
Hint: Precomputing Merge Rules Build a dictionary mapping (left_id, right_id) pairs to the target token they merge into:
self.merge_rules: dict[tuple[int, int], int] = {} for token_id in range(256, len(self.token_alphabet)): token_bytes = self.token_alphabet[token_id] for split_point in range(1, len(token_bytes)): left_bytes = token_bytes[:split_point] right_bytes = token_bytes[split_point:] left_id = self.token_to_id.get(left_bytes) right_id = self.token_to_id.get(right_bytes) if left_id is not None and right_id is not None: self.merge_rules[(left_id, right_id)] = token_id
Hint: Priority Queue Algorithm Use a heap to process merges in ascending token ID order:
`
Initialize heap with all mergeable adjacent pairs
heap = [] for i in range(len(token_ids) - 1): pair = (token_ids[i], token_ids[i + 1]) if pair in self.merge_rules: heapq.heappush(heap, (self.merge_rules[pair], i))
while heap: target_id, pos = heapq.heappop(heap) # Validate merge is still possible # Perform merge: update token_ids[pos], mark next position as deleted # Update linked list and push new adjacent pairs to heap `
Full Solution (Part 1) `` Time complexity: O(N log N) where N is the text length
Space complexity: O(N) for linked list and heap
Implement estimate_token_count that estimates len(self.tokenize(text)) without tokenizing the entire text.
The key observation is that the tokens-per-byte ratio tends to be consistent across a document. We can:
Why stratified over pure random? Stratified sampling ensures coverage across the entire document, reducing variance — especially important for heterogeneous text.
Hint: Stratified Sampling Strategy `
Determine chunk size and number of samples
chunk_size = max(250, min(sample_size // 4, len(text) // 10)) num_samples = sample_size // chunk_size if num_samples < 1: num_samples = 1 chunk_size = sample_size
segment_size = len(text) // num_samples
Sample from each segment
for i in range(num_samples): seg_start = i * segment_size seg_end = min(seg_start + segment_size, len(text)) # Pick random chunk within segment # Tokenize and accumulate ratio `
Full Solution (Part 2) `` Why this works: The tokens-per-byte ratio is relatively stable within a document. Boundary effects (tokens that would span chunk boundaries) contribute less than 1% error with chunks of 250+ bytes.
Accuracy by sample size:
- sample_size = 1000: 4 chunks of 250 bytes, ~20% accuracy
- sample_size = 10000: 4 chunks of 2500 bytes, less than 5% error
- sample_size = 50000: 4 chunks of 12500 bytes, less than 1% error
The ascending token ID order corresponds to the order in which tokens were learned during BPE training. Token 256 was the first merge learned (most frequent byte pair), token 257 was the second, etc. Applying merges in this order ensures deterministic tokenization matching training.
Production tokenizers (tiktoken, SentencePiece, HuggingFace tokenizers) use optimized versions:
The priority queue approach in Part 1 matches the algorithmic complexity of these implementations.
More sophisticated approaches:
def estimate_token_count(
self,
text: bytes,
sample_size: int,
rng: random.Random,
) -> int:
if sample_size >= len(text):
return len(self.tokenize(text))
# Determine chunk size and number of samples
chunk_size = max(250, min(sample_size // 4, len(text) // 10))
num_samples = sample_size // chunk_size
if num_samples < 1:
num_samples = 1
chunk_size = sample_size
segment_size = len(text) // num_samples
total_tokens = 0
total_bytes = 0
for i in range(num_samples):
# Pick a random chunk within the i-th segment
seg_start = i * segment_size
seg_end = min(seg_start + segment_size, len(text))
max_start = max(seg_start, seg_end - chunk_size)
start = rng.randint(seg_start, max_start)
end = min(start + chunk_size, len(text))
chunk = text[start:end]
tokens = self.tokenize(chunk)
total_tokens += len(tokens)
total_bytes += len(chunk)
# Extrapolate
ratio = total_tokens / total_bytes
return round(ratio * len(text))