Level: Senior-Level
Round: Online Assessment · Type: Coding · Difficulty: 7/10 · Duration: 60 min · Interviewer: Neutral
Topics: Tokenization, Algorithms, Optimization, Randomization, Estimation
Location: San Francisco Bay Area
Interview date: 2026-01-20
I had an online assessment that involved implementing a byte tokenizer and estimating token counts.
This online assessment consisted of three parts:
Open the byte_tokenizer.py file and carefully study the starter code.
The starter code contains a ByteTokenizer class with a reference implementation of a tokenization algorithm in the slow_tokenize method.
You should treat the reference implementation's behavior as authoritative, even if you think some other behavior is preferable.
In particular, you should not assume any particular input constraint unless that constraint follows from the problem specification or the code/comments of the reference implementation.
NOTE: even if you're familiar with LLM tokenization, you should still carefully study the code in byte_tokenizer.py as there may be subtle differences from the implementations that you've seen before.
We recommend spending at least 10 minutes on studying and understanding the reference implementation, as misunderstandings will greatly slow you down once you start coding in Parts 1 and 2.
Once you feel you've thoroughly understood the reference implementation, please open part0_description.md and write a natural language (English) description of how the tokenization algorithm works.
Your description should be precise and detailed enough that a smart computer science student could reimplement the algorithm from scratch.
Once you feel satisfied with your description, you may proceed to the next part.
The reference implementation in slow_tokenize is, as its name might suggest, rather slow.
Please develop a faster implementation in the tokenize method.
The tokenize method should return identical values as the slow_tokenize method.
Any differences in return value between slow_tokenize and tokenize are considered incorrect.
Ideally, you did Part 0 carefully and have a detailed understanding of precisely what slow_tokenize does.
However, if you realize that your understanding is incomplete or incorrect, you should feel free to go back and edit your Part 0 response in part0_description.md.
You may implement any preprocessing logic you would like in the constructor.
You may also implement any helper functions you would like.
Note that only code in byte_tokenizer.py will be graded; therefore, keep all helper functions within that file.
You may import modules from the standard library, but do not import third-party libraries.
You should strive to have the Part 1 tests run in 10 seconds or less (total time across test cases).
The Part 1 tests involve tokenizing a total of ~500 kilobytes of text with a token alphabet of 10000+ distinct tokens.
Achieving this will require both algorithmic improvements and reasonably efficient coding.
However, it should not require hyperoptimization (e.g., writing C++ bindings, bit manipulation, etc).
Faster runtimes are even better - but again, do not hyperoptimize.
Note that only the runtime of tokenize() is measured.
You may perform any preprocessing logic in the constructor and it will not be considered as part of the total runtime (of course, any preprocessing should run in a reasonable amount of time).
You can test your code by clicking on the green dropdown menu in CoderPad and selecting:
Run Tests (Part 1)
Note that the private test set will be more comprehensive than the tests provided to you.
If you'd like, you may further test your tokenize method either by modifying test_part1.py or by using the scratchpad in scratchpad.py.
The Part 1 cases involve large strings.
For debugging, you may instead choose to run a simplified test suite by selecting:
Run Tests (Part 1 - simple)
However, once you are done debugging, you should use the full test suite to verify your code's correctness and speed.
Once you are satisfied with the correctness, code quality, and speed of your implementation, you may proceed to the next part.
Suggested time: 20-40 minutes.
One common task when working with text models is counting the number of tokens in a byte sequence.
This is especially important when working with large inputs, as we may need to limit the length of such inputs that are provided to a model.
The most straightforward way to count the number of tokens in a given byte sequence text would be:
len(tokenizer.tokenize(text))
However, even with your efficient implementation from Part 1, it may not be possible to run tokenize across all inputs we care about.
Therefore, we would like to estimate token counts.
Within the ByteTokenizer class, please implement a new method with the following signature:
estimate_token_count(self, text, sample_size, rng)
The estimate_token_count method should estimate the number of tokens returned by:
self.tokenize(text)
The method is allowed to call self.tokenize an arbitrary number of times, as long as the total number of bytes passed into self.tokenize (summed across all calls) does not exceed sample_size.
You may assume that sample_size is at least 1000.
You may use randomization, but any randomization should be performed through the rng parameter, which is a random.Random object.
Other than calls to self.tokenize, your code should not implement any tokenizer-like functionality.
IMPORTANT:
If:
sample_size >= len(text)
then your code should return an exact token count.
As in Part 1, only code in byte_tokenizer.py will be graded so you should make sure any logic (including helper functions) is contained within that file.
You may import modules from the standard library, but do not import third-party libraries.
For text that's larger than the sample size, your code should be able to estimate the token count within 20% of the true value.
For sample sizes exceeding 10000, the accuracy should be within 5% of the true value.
Meeting these expectations suffices to pass this online assessment (simple approaches are fine).
If you would like, you can treat Part 2 as a more open-ended exercise by exploring additional techniques to further improve your code's accuracy.
You can test your code by clicking on the green dropdown menu in CoderPad and selecting:
Run Tests (Part 2)
These tests will pass regardless of the actual estimates returned by your solution.
At the end, a table will display your estimation accuracy across the different test configurations.
Note that the private test set will be more comprehensive than the tests provided to you.
If you'd like, you may further test your estimate_token_count method either by modifying test_part2.py or by using the scratchpad in scratchpad.py.