Design Async Job System

[ OK ] 284 — full content available

[ INFO ] category: System Design difficulty: hard freq: high first seen: 2026-01-13

[HARD][SYSTEM DESIGN][HIGH]data_engineeringDistributed Systemswebmachine_learningSystem DesignBackendQueuemobilebackendinfrastructure

$ cat problem.md

Design a distributed asynchronous job processing system that can handle millions of jobs per day across multiple data centers. The system must support different job types (e.g., email sending, report generation, ML model training, image processing) with varying priorities, resource requirements, and execution times ranging from milliseconds to hours. Jobs should be submitted via REST API and mobile SDKs, and clients must be able to query real-time status and receive completion callbacks. The system must guarantee at-least-once execution, support exactly-once semantics for idempotent jobs, and automatically retry failed jobs with exponential backoff. Workers will be containerized services running in Kubernetes that can scale horizontally based on queue depth and resource utilization. Design for 99.9% availability, handle worker failures gracefully, and ensure no job is lost even during deployments or region-wide outages. Include support for job dependencies (DAGs), rate limiting per tenant, priority queues, and dead letter queues for permanently failed jobs. The solution should optimize for both throughput and latency while maintaining fairness across tenants.

user@intervues:~/salesforce$