Data Batcher for ML Training

[ OK ] 337 — full content available

[ INFO ] category: Coding · Domain Specific difficulty: hard freq: medium first seen: 2026-01-13

[HARD][DOMAIN SPECIFIC][MEDIUM]Data EngineeringML InfraCheckpointing

$ cat problem.md

Design and implement a data-batcher service that feeds training data to an ML model. The service must support multiple data sources, each with an integer weight. Every training batch must contain examples sampled from the sources in proportion to their weights. For example, if source A has weight 3 and source B has weight 1, every batch of 8 examples must contain 6 from A and 2 from B. When the requested batch size is not evenly divisible by the total weight, use the largest-remainder method: compute exact fractional counts, assign floor values, then assign the remaining slots to the sources with the largest fractional parts. The service must also support checkpointing: after every batch, persist the current offset (next unread index) for every source to disk so that training can be resumed exactly where it left off after a restart. On startup, if a checkpoint file exists, restore the offsets and continue sampling from those positions. Each source is an in-order, repeatable iterable; when an offset reaches the end of a source, wrap around to the beginning. Expose a single public method get_batch(batch_size: int) -> List[Example] that returns the next batch and updates the offsets atomically.

user@intervues:~/anthropic$