Practice/Perplexity/Embedding Model Server
Embedding Model Server
CodingMust
Problem Overview
Design and implement a service that wraps an EmbeddingModel class to handle batch processing of text embeddings. The server should efficiently process batches of data and support graceful shutdown.
This problem tests your understanding of API design, batch processing, resource management, and error handling in production systems.
Key Requirements
- Batch Processing: Process multiple text inputs simultaneously
- Graceful Shutdown: Handle a special shutdown command to stop the server
- Result Mapping: Return embeddings with their corresponding IDs
- Scalability: Handle batches with up to 100 elements
- Error Handling: Gracefully handle failures and edge cases
Important Notes
- Start with a simple single-threaded implementation, then extend to multi-threaded if asked
- Focus on clean API design and proper resource management
- Consider production concerns like monitoring, error handling, and graceful shutdown
- Write your own test cases and ensure your code compiles and runs correctly
Part 1: Single-Threaded Implementation
Implement a basic server that processes requests sequentially.
Input/Output Format
Input (Batch Request):
batch_data = [ {'id': '1', 'data': 'example text 1'}, {'id': '2', 'data': 'example text 2'} ]
Output (Batch Response):
[ {'id': '1', 'embedding': [0.1, 0.2, ...]}, {'id': '2', 'embedding': [0.3, 0.4, ...]} ]
Shutdown:
`
Input
request = 'shutdown'
Output
'Server shutdown complete.'
`
Example
`
server = ModelServer()
Process batch
result = server.handle_request([
{'id': '1', 'data': 'machine learning'},
{'id': '2', 'data': 'neural networks'}
])
Returns: [{'id': '1', 'embedding': [...]}, {'id': '2', 'embedding': [...]}]
Shutdown
result = server.handle_request('shutdown')
Returns: 'Server shutdown complete.'
`
Requirements
- Parse and validate input requests
- Handle both batch processing and shutdown commands
- Maintain ID mapping between input and output
- Process items sequentially (no parallelization yet)
- Return results in a consistent format
Clarification Questions to Ask
- Should we validate the batch size? What's the maximum?
- How should we handle individual item failures within a batch?
- What embedding dimension should we use (384, 768, 1536)?
- Should results maintain the same order as input?
- How should we handle duplicate IDs in a batch?
Part 2: Multi-Threaded Implementation (Follow-up)
Extend the solution to process batch items in parallel using a thread pool for better throughput.
Why Threading?
Performance Comparison:
Performance ComparisonImplementationTime for 100 items (10ms each)ThroughputSingle-threaded~1000ms (sequential)100 items/sMulti-threaded (10 workers)~100ms (parallel)1000 items/s
10x speedup with proper parallelization!
Example
`
from concurrent.futures import ThreadPoolExecutor
server = ThreadSafeModelServer(max_workers=10)
Process large batch in parallel
batch = [{'id': str(i), 'data': f'text {i}'} for i in range(100)]
start_time = time.time()
results = server.handle_request(batch)
elapsed = time.time() - start_time
print(f"Processed {len(results)} items in {elapsed:.2f}s")
Output: Processed 100 items in ~0.10s (with 10 workers)
`
Requirements
- Use
ThreadPoolExecutor for managing worker threads
- Submit all items to thread pool concurrently
- Collect results as they complete using
as_completed
- Handle individual item failures without crashing entire batch
- Implement thread-safe shutdown with proper cleanup
- Validate batch size limit (max 100 items)