Practice/xAI/Dynamic Batch Inference
CodingMust
Design and implement a Dynamic Batch Inference Engine that efficiently processes multiple generation requests by batching them together. This is a simplified version of what production LLM inference engines (like vLLM, TensorRT-LLM, or TGI) do to serve models like Grok.
You are given a simulated language model interface that generates the next token for a batch of sequences. Your task is to implement a BatchInferenceEngine that:
max_tokens limit and stop_token` model = SimulatedLLM() engine = BatchInferenceEngine(model, batch_size=4, stop_token=0)
results = []
engine.submit_request([1, 2, 3], max_tokens=5, callback=lambda seq: results.append(seq)) engine.submit_request([10, 20], max_tokens=3, callback=lambda seq: results.append(seq)) engine.submit_request([100], max_tokens=10, callback=lambda seq: results.append(seq))
engine.run()
`
run() is executing, or only before?Implement the core batching logic without dynamic slot filling. All requests submitted before run() are processed together.
` model = SimulatedLLM() engine = BatchInferenceEngine(model, batch_size=2, stop_token=0)
results = [] engine.submit_request([1, 2], max_tokens=3, callback=lambda s: results.append(s)) engine.submit_request([10, 20], max_tokens=3, callback=lambda s: results.append(s)) engine.run()
`
batch_size requests in parallelmax_tokens or generates stop_tokenExtend the implementation to fill empty slots with waiting requests as sequences complete.
` engine = BatchInferenceEngine(model, batch_size=2, stop_token=0)
engine.submit_request([1], max_tokens=1, callback=...) # Finishes after 1 step engine.submit_request([2], max_tokens=3, callback=...) # Finishes after 3 steps engine.submit_request([3], max_tokens=2, callback=...) # Queued initially engine.submit_request([4], max_tokens=1, callback=...) # Queued initially
engine.run()
`