Coding - Dynamic Batch Inference — xAI

Dynamic Batch Inference

Problem Statement

Design and implement a Dynamic Batch Inference Engine that efficiently processes multiple generation requests by batching them together. This is a simplified version of what production LLM (Large Language Models) systems use to handle multiple requests concurrently.

Examples

Input:
- Requests: ["Hello", "World"]
- Batch size: 2
- Model: GPT-3
Output: ["Hello", "World"]
Input:
- Requests: ["Hello", "World", "This", "is", "a", "test"]
- Batch size: 3
- Model: GPT-3
Output: ["Hello", "World", "This", "is", "a", "test"]

Constraints

Each request is a string with a maximum length of 100 characters.
The batch size can range from 1 to 100.
The number of requests can range from 1 to 1000.
The model can be any LLM (e.g., GPT-3, T5, BERT).

Hints

Use a queue to manage incoming requests and a priority queue to manage batches.
Consider using multithreading or asynchronous programming to handle multiple batches concurrently.
Optimize memory usage by processing batches in chunks.

Solution

`python import queue import threading

class DynamicBatchInferenceEngine: def init(self, batch_size): self.batch_size = batch_size self.request_queue = queue.Queue() self.batch_queue = queue.PriorityQueue() self.lock = threading.Lock()

def add_request(self, request):
    with self.lock:
        self.request_queue.put(request)

def process_requests(self):
    while True:
        batch = []
        while not self.request_queue.empty() and len(batch) < self.batch_size:
            batch.append(self.request_queue.get())

        if batch:
            self.batch_queue.put((len(batch), batch))
            self._process_batch(batch)

        if self.request_queue.empty() and self.batch_queue.empty():
            break

def _process_batch(self, batch):
    # Simulate model inference
    print(batch)

def main(): engine = DynamicBatchInferenceEngine(batch_size=3) requests = ["Hello", "World", "This", "is", "a", "test"]

for request in requests:
    engine.add_request(request)

engine.process_requests()

if name == "main": main() `

This solution uses a queue to manage incoming requests and a priority queue to manage batches. It processes batches in chunks using multithreading to handle multiple batches concurrently. The _process_batch method simulates model inference by printing the batch.