Practice/Roblox/Design a Delayed Payment Scheduler Service

Design a Delayed Payment Scheduler Service

System DesignMust

Problem Statement

Design a system that allows users to schedule payments of virtual currency to be executed at a future time. In a gaming platform like Roblox, users might schedule Robux transfers for birthdays, recurring allowances, or timed rewards. The scheduler must execute payments precisely at their scheduled time, handle failures gracefully with retries, and guarantee that no payment is executed twice or missed entirely.

This problem combines two difficult distributed systems challenges: reliable time-based job scheduling at massive scale and idempotent financial transaction processing. Unlike simple cron jobs, this system must handle millions of independently scheduled payments with different execution times, survive server restarts and data center failures without losing any scheduled tasks, and provide exactly-once execution semantics for operations that move real money (or virtual currency with real-world value). Interviewers use this question to evaluate your understanding of durable task queues, state machine design, payment idempotency, and operational concerns like monitoring and manual intervention.

The real-world complexity comes from edge cases: what happens when a payment is due but the recipient's account is locked? What if the sender's balance changed between scheduling and execution? How do you handle timezone-sensitive scheduling and clock skew across distributed nodes?

Key Requirements

Functional

Schedule future payments -- users specify an amount, recipient, and execution time; the system persists and acknowledges the request immediately
Execute payments at the scheduled time -- the system triggers payment processing within a small tolerance window (under 60 seconds) of the target time
Retry on failure -- transient failures (network timeouts, temporary service unavailability) trigger automatic retries with exponential backoff
Cancel or modify scheduled payments -- users can cancel pending payments or adjust the amount and time before execution

Non-Functional

Scalability -- support 50 million scheduled payments with 500,000 executions per day, peaking at 30,000 per minute
Reliability -- guarantee at-least-once execution with no missed payments; tolerate node failures, restarts, and network partitions
Latency -- payment creation acknowledged within 200ms; scheduled payments execute within 60 seconds of their target time
Consistency -- ensure exactly-once payment execution through idempotency; prevent double-charges even under retries and failovers

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Durable Time-Based Job Scheduling

You need a mechanism to reliably execute millions of jobs at specific future timestamps without losing tasks during crashes or restarts. This is the heart of the problem.

Hints to consider:

Consider time-partitioned database indexes or sorted data structures that workers poll by time window rather than scanning all records
Evaluate push-based (timers fire callbacks) versus pull-based (workers poll for due jobs) models and their failure characteristics
Think about how to shard scheduled payments across worker nodes to avoid hot spots while ensuring coverage
Plan for clock skew between distributed nodes when comparing current time against scheduled execution time

2. Payment Idempotency and State Machine Design

Retries and failovers are inevitable in distributed systems. You must prevent duplicate payment execution while ensuring every scheduled payment eventually completes.

Hints to consider:

Model each payment as a state machine (SCHEDULED to PROCESSING to COMPLETED or FAILED to RETRYING) with atomic state transitions
Use an idempotency key derived from the payment ID and schedule cycle to make retry-safe calls to the payment processing service
Distinguish between retryable errors (network timeout, rate limit) and terminal failures (insufficient balance, account closed)
Design compensation logic for cases where the debit succeeds but the credit fails, leaving the system in an inconsistent state

3. Worker Coordination and Partition Assignment

With millions of scheduled payments, multiple workers must process due jobs in parallel without duplicating work or leaving gaps.

Hints to consider:

Use consistent hashing or range-based partitioning to assign time windows or payment ID ranges to specific workers
Implement lease-based ownership where a worker claims a partition for a fixed duration and must renew the lease periodically
Design for worker failures: if a worker dies mid-processing, another worker should detect the stale lease and take over within seconds
Consider using a coordination service like ZooKeeper or etcd for leader election and partition assignment

4. Handling Edge Cases and Operational Concerns

Real systems face many edge cases that naive designs overlook. Interviewers probe whether you think beyond the happy path.

Hints to consider:

Handle "thundering herd" scenarios where many payments share the same execution time (e.g., midnight resets or holiday events)
Design admin tools for pausing, resuming, or force-executing individual payments when automation fails
Implement dead-letter queues for payments that exhaust all retry attempts, with alerting and manual review workflows
Emit operational metrics: execution lag (time between scheduled and actual execution), retry counts, success rates, and queue depth

Suggested Approach

Step 1: Clarify Requirements

Confirm the scope with the interviewer. Ask about the volume of scheduled payments and their time distribution (are they spread evenly or clustered around specific times?). Clarify whether payments are one-time or recurring, and if recurring, how the schedule is defined. Ask about the payment processing system: is it an external API with rate limits, or an internal service you control? Clarify the tolerance for execution delay (is 30 seconds acceptable, or must it be under 5 seconds?). Understand the failure model: what happens if the sender's balance is insufficient at execution time? Finally, ask about cancellation semantics: can a payment be cancelled while it is actively being processed?

Practice/Roblox/Design a Delayed Payment Scheduler Service

Design a Delayed Payment Scheduler Service

System DesignMust

Problem Statement

Key Requirements

Functional

Schedule future payments -- users specify an amount, recipient, and execution time; the system persists and acknowledges the request immediately
Execute payments at the scheduled time -- the system triggers payment processing within a small tolerance window (under 60 seconds) of the target time
Retry on failure -- transient failures (network timeouts, temporary service unavailability) trigger automatic retries with exponential backoff
Cancel or modify scheduled payments -- users can cancel pending payments or adjust the amount and time before execution

Non-Functional

Scalability -- support 50 million scheduled payments with 500,000 executions per day, peaking at 30,000 per minute
Reliability -- guarantee at-least-once execution with no missed payments; tolerate node failures, restarts, and network partitions
Latency -- payment creation acknowledged within 200ms; scheduled payments execute within 60 seconds of their target time
Consistency -- ensure exactly-once payment execution through idempotency; prevent double-charges even under retries and failovers

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Durable Time-Based Job Scheduling

You need a mechanism to reliably execute millions of jobs at specific future timestamps without losing tasks during crashes or restarts. This is the heart of the problem.

Hints to consider:

Consider time-partitioned database indexes or sorted data structures that workers poll by time window rather than scanning all records
Evaluate push-based (timers fire callbacks) versus pull-based (workers poll for due jobs) models and their failure characteristics
Think about how to shard scheduled payments across worker nodes to avoid hot spots while ensuring coverage
Plan for clock skew between distributed nodes when comparing current time against scheduled execution time

2. Payment Idempotency and State Machine Design

Retries and failovers are inevitable in distributed systems. You must prevent duplicate payment execution while ensuring every scheduled payment eventually completes.

Hints to consider:

Model each payment as a state machine (SCHEDULED to PROCESSING to COMPLETED or FAILED to RETRYING) with atomic state transitions
Use an idempotency key derived from the payment ID and schedule cycle to make retry-safe calls to the payment processing service
Distinguish between retryable errors (network timeout, rate limit) and terminal failures (insufficient balance, account closed)
Design compensation logic for cases where the debit succeeds but the credit fails, leaving the system in an inconsistent state

3. Worker Coordination and Partition Assignment

With millions of scheduled payments, multiple workers must process due jobs in parallel without duplicating work or leaving gaps.

Hints to consider:

Use consistent hashing or range-based partitioning to assign time windows or payment ID ranges to specific workers
Implement lease-based ownership where a worker claims a partition for a fixed duration and must renew the lease periodically
Design for worker failures: if a worker dies mid-processing, another worker should detect the stale lease and take over within seconds
Consider using a coordination service like ZooKeeper or etcd for leader election and partition assignment

4. Handling Edge Cases and Operational Concerns

Real systems face many edge cases that naive designs overlook. Interviewers probe whether you think beyond the happy path.

Hints to consider:

Handle "thundering herd" scenarios where many payments share the same execution time (e.g., midnight resets or holiday events)
Design admin tools for pausing, resuming, or force-executing individual payments when automation fails
Implement dead-letter queues for payments that exhaust all retry attempts, with alerting and manual review workflows
Emit operational metrics: execution lag (time between scheduled and actual execution), retry counts, success rates, and queue depth